跳到主要内容

2025-05-29-12-07

Make Planning Research Rigorous Again!

Abstract

arXiv:2505.21674v1 Announce Type: new Abstract: In over sixty years since its inception, the field of planning has made significant contributions to both the theory and practice of building planning software that can solve a never-before-seen planning problem. This was done through established practices of rigorous design and evaluation of planning systems. It is our position that this rigor should be applied to the current trend of work on planning with large language models. One way to do so is by correctly incorporating the insights, tools, and data from the automated planning community into the design and evaluation of LLM-based planners. The experience and expertise of the planning community are not just important from a historical perspective; the lessons learned could play a crucial role in accelerating the development of LLM-based planners. This position is particularly important in light of the abundance of recent works that replicate and propagate the same pitfalls that the planning community has encountered and learned from. We believe that avoiding such known pitfalls will contribute greatly to the progress in building LLM-based planners and to planning in general.

摘要

自规划领域诞生六十余年来,其在构建能够解决全新规划问题的规划软件理论与实践方面做出了重大贡献。这一成就源于对规划系统进行严格设计与评估的既定实践。我们认为,当前基于大语言模型的规划研究热潮同样需要贯彻这种严谨性。实现路径之一是将自动化规划领域的洞见、工具和数据正确整合到基于LLM的规划器设计与评估中。规划界的经验与专业积淀不仅具有历史意义,其积累的教训更能对加速LLM规划器发展起到关键作用。鉴于近期大量研究正在重复规划领域曾遭遇并克服过的相同陷阱,这一立场显得尤为重要。我们相信,规避这些已知陷阱将极大推动基于LLM的规划器发展,并对整个规划领域产生深远影响。


Herd Behavior: Investigating Peer Influence in LLM-based Multi-Agent Systems

Abstract

arXiv:2505.21588v1 Announce Type: new Abstract: Recent advancements in Large Language Models (LLMs) have enabled the emergence of multi-agent systems where LLMs interact, collaborate, and make decisions in shared environments. While individual model behavior has been extensively studied, the dynamics of peer influence in such systems remain underexplored. In this paper, we investigate herd behavior, the tendency of agents to align their outputs with those of their peers, within LLM-based multi-agent interactions. We present a series of controlled experiments that reveal how herd behaviors are shaped by multiple factors. First, we show that the gap between self-confidence and perceived confidence in peers significantly impacts an agent's likelihood to conform. Second, we find that the format in which peer information is presented plays a critical role in modulating the strength of herd behavior. Finally, we demonstrate that the degree of herd behavior can be systematically controlled, and that appropriately calibrated herd tendencies can enhance collaborative outcomes. These findings offer new insights into the social dynamics of LLM-based systems and open pathways for designing more effective and adaptive multi-agent collaboration frameworks.

摘要

大型语言模型(LLM)的最新进展推动了多智能体系统的出现,这些系统中的LLM能够在共享环境中交互、协作并做出决策。尽管单个模型的行为已得到广泛研究,但此类系统中同伴影响的动态机制仍未充分探索。本文研究了基于LLM的多智能体交互中的从众行为——即智能体倾向于使其输出与同伴保持一致的倾向。我们通过一系列受控实验揭示了从众行为如何受多种因素影响:首先,研究表明自我置信度与感知同伴置信度之间的差距显著影响智能体的从众概率;其次,发现同伴信息的呈现形式对调节从众行为强度具有关键作用;最后,我们证明从众程度可被系统调控,且适当校准的从众倾向能提升协作效果。这些发现为基于LLM系统的社会动力学提供了新见解,并为设计更高效、自适应的多智能体协作框架开辟了路径。


Incorporating LLMs for Large-Scale Urban Complex Mobility Simulation

Abstract

arXiv:2505.21880v1 Announce Type: new Abstract: This study presents an innovative approach to urban mobility simulation by integrating a Large Language Model (LLM) with Agent-Based Modeling (ABM). Unlike traditional rule-based ABM, the proposed framework leverages LLM to enhance agent diversity and realism by generating synthetic population profiles, allocating routine and occasional locations, and simulating personalized routes. Using real-world data, the simulation models individual behaviors and large-scale mobility patterns in Taipei City. Key insights, such as route heat maps and mode-specific indicators, provide urban planners with actionable information for policy-making. Future work focuses on establishing robust validation frameworks to ensure accuracy and reliability in urban planning applications.

摘要

本研究提出一种创新性城市移动性模拟方法,通过将大语言模型(LLM)与基于智能体的建模(ABM)相结合。与传统基于规则的ABM不同,该框架利用LLM生成合成人口特征、分配常规与偶发活动地点,并模拟个性化路线,从而增强智能体多样性与真实性。基于台北市真实数据的仿真实验,成功模拟了个体行为与大规模移动模式。关键发现如路线热力图和交通方式专项指标,为城市规划者提供了可操作的决策依据。未来工作将致力于建立稳健的验证框架,以确保城市规划应用中的准确性与可靠性。


Abstract

arXiv:2505.21575v1 Announce Type: new Abstract: Large Language Models (LLMs) have shown remarkable proficiency in natural language understanding (NLU), opening doors for innovative applications. We introduce StreamLink - an LLM-driven distributed data system designed to improve the efficiency and accessibility of data engineering tasks. We build StreamLink on top of distributed frameworks such as Apache Spark and Hadoop to handle large data at scale. One of the important design philosophies of StreamLink is to respect user data privacy by utilizing local fine-tuned LLMs instead of a public AI service like ChatGPT. With help from domain-adapted LLMs, we can improve our system's understanding of natural language queries from users in various scenarios and simplify the procedure of generating database queries like the Structured Query Language (SQL) for information processing. We also incorporate LLM-based syntax and security checkers to guarantee the reliability and safety of each generated query. StreamLink illustrates the potential of merging generative LLMs with distributed data processing for comprehensive and user-centric data engineering. With this architecture, we allow users to interact with complex database systems at different scales in a user-friendly and security-ensured manner, where the SQL generation reaches over 10% of execution accuracy compared to baseline methods, and allow users to find the most concerned item from hundreds of millions of items within a few seconds using natural language.

摘要

大型语言模型(LLMs)在自然语言理解(NLU)方面展现出卓越能力,为创新应用开辟了道路。我们提出StreamLink——一个基于LLM的分布式数据系统,旨在提升数据工程任务的效率与可访问性。该系统构建于Apache Spark和Hadoop等分布式框架之上,以支持大规模数据处理。StreamLink的重要设计理念之一是通过采用本地微调的LLMs(而非ChatGPT等公共AI服务)来保障用户数据隐私。借助领域适配的LLMs,我们能够增强系统对多样化场景下用户自然语言查询的理解能力,并简化生成结构化查询语言(SQL)等数据库查询的信息处理流程。系统还集成了基于LLM的语法与安全检查器,确保每个生成查询的可靠性与安全性。StreamLink展现了生成式LLMs与分布式数据处理技术融合的潜力,可实现以用户为中心的全方位数据工程。通过该架构,用户能以友好且安全的方式与不同规模的复杂数据库系统交互:相比基线方法,其SQL生成执行准确率提升超过10%,并支持用户在数秒内从数亿条数据中通过自然语言定位最关注的项目。


Towards Safety Reasoning in LLMs: AI-agentic Deliberation for Policy-embedded CoT Data Creation

Abstract

arXiv:2505.21784v1 Announce Type: new Abstract: Safety reasoning is a recent paradigm where LLMs reason over safety policies before generating responses, thereby mitigating limitations in existing safety measures such as over-refusal and jailbreak vulnerabilities. However, implementing this paradigm is challenging due to the resource-intensive process of creating high-quality policy-embedded chain-of-thought (CoT) datasets while ensuring reasoning remains accurate and free from hallucinations or policy conflicts. To tackle this, we propose AIDSAFE: Agentic Iterative Deliberation for Safety Reasoning, a novel data generation recipe that leverages multi-agent deliberation to iteratively expand reasoning on safety policies. A data refiner stage in AIDSAFE ensures high-quality outputs by eliminating repetitive, redundant, and deceptive thoughts. AIDSAFE-generated CoTs provide a strong foundation for supervised fine-tuning (SFT)-based safety training. Additionally, to address the need of preference data in alignment stages, such as DPO training, we introduce a supplemental recipe that uses belief augmentation to create distinct selected and rejected CoT samples. Our evaluations demonstrate that AIDSAFE-generated CoTs achieve superior policy adherence and reasoning quality. Consequently, we show that fine-tuning open-source LLMs on these CoTs can significantly improve safety generalization and jailbreak robustness while maintaining acceptable utility and over-refusal accuracy. AIDSAFE-generated CoT datasets can be found here: https://huggingface.co/datasets/AmazonScience/AIDSAFE

摘要

安全推理是一种新兴范式,大型语言模型(LLM)通过先对安全策略进行推理再生成响应,从而缓解现有安全措施(如过度拒绝和越狱漏洞)的局限性。然而,由于创建高质量策略嵌入思维链(CoT)数据集需要耗费大量资源,同时还需确保推理的准确性和避免幻觉或策略冲突,该范式的实施面临挑战。为此,我们提出AIDSAFE:安全推理的代理迭代审议方法——一种利用多智能体审议迭代扩展安全策略推理的新型数据生成方案。AIDSAFE中的数据精炼阶段通过消除重复、冗余和欺骗性思维来保证输出质量。AIDSAFE生成的思维链为基于监督微调(SFT)的安全训练提供了坚实基础。此外,针对对齐阶段(如DPO训练)对偏好数据的需求,我们引入了一种补充方案,利用信念增强来创建差异化的选定与拒绝思维链样本。评估表明,AIDSAFE生成的思维链在策略遵循性和推理质量上表现优异。实验证明,基于这些思维链对开源LLM进行微调,可显著提升安全泛化能力和越狱鲁棒性,同时保持可接受的实用性和过度拒绝准确性。AIDSAFE生成的思维链数据集详见:https://huggingface.co/datasets/AmazonScience/AIDSAFE


From Reasoning to Learning: A Survey on Hypothesis Discovery and Rule Learning with Large Language Models

Abstract

arXiv:2505.21935v1 Announce Type: new Abstract: Since the advent of Large Language Models (LLMs), efforts have largely focused on improving their instruction-following and deductive reasoning abilities, leaving open the question of whether these models can truly discover new knowledge. In pursuit of artificial general intelligence (AGI), there is a growing need for models that not only execute commands or retrieve information but also learn, reason, and generate new knowledge by formulating novel hypotheses and theories that deepen our understanding of the world. Guided by Peirce's framework of abduction, deduction, and induction, this survey offers a structured lens to examine LLM-based hypothesis discovery. We synthesize existing work in hypothesis generation, application, and validation, identifying both key achievements and critical gaps. By unifying these threads, we illuminate how LLMs might evolve from mere ``information executors'' into engines of genuine innovation, potentially transforming research, science, and real-world problem solving.

摘要

自大型语言模型(LLMs)问世以来,研究重点多集中于提升其指令遵循与演绎推理能力,而关于这些模型能否真正发现新知识的问题仍悬而未决。在追求通用人工智能(AGI)的过程中,我们日益需要模型不仅能执行指令或检索信息,更能通过学习、推理和生成新知识来提出深化人类认知的新假设与理论。本文以皮尔士的"溯因-演绎-归纳"框架为指导,为基于LLM的假设发现研究提供结构化视角。我们系统梳理了假设生成、应用与验证领域的现有成果,既总结了关键突破,也指出了核心缺陷。通过整合这些研究方向,本文阐明了LLMs如何可能从单纯的"信息执行者"蜕变为真正创新的引擎,从而潜在变革科学研究与现实问题解决的范式。


Large Language Models for Solving Economic Dispatch Problem

Abstract

arXiv:2505.21931v1 Announce Type: new Abstract: This paper investigates the capability of off-the-shelf large language models (LLMs) to solve the economic dispatch (ED) problem. ED is a hard-constrained optimization problem solved on a day-ahead timescale by grid operators to minimize electricity generation costs while accounting for physical and engineering constraints. Numerous approaches have been proposed, but these typically require either mathematical formulations, face convergence issues, or depend on extensive labeled data and training time. This work implements LLMs enhanced with reasoning capabilities to address the classic lossless ED problem. The proposed approach avoids the need for explicit mathematical formulations, does not suffer from convergence challenges, and requires neither labeled data nor extensive training. A few-shot learning technique is utilized in two different prompting contexts. The IEEE 118-bus system with 19 generation units serves as the evaluation benchmark. Results demonstrate that various prompting strategies enable LLMs to effectively solve the ED problem, offering a convenient and efficient alternative. Consequently, this approach presents a promising future solution for ED tasks, particularly when foundational power system models are available.

摘要

本文研究了现成大型语言模型(LLMs)解决经济调度(ED)问题的能力。ED是电网运营商在日前时间尺度上求解的硬约束优化问题,旨在满足物理和工程约束的同时最小化发电成本。尽管已有多种解决方案,但这些方法通常需要数学公式、面临收敛问题,或依赖大量标注数据和训练时间。本研究采用具备推理能力增强的LLMs来解决经典的无损ED问题,所提方法无需显式数学公式、不存在收敛挑战,且不需要标注数据或大量训练。我们在两种不同的提示场景中应用了小样本学习技术,并以包含19台发电机组的IEEE 118节点系统作为评估基准。结果表明,多种提示策略能使LLMs有效求解ED问题,提供了一种便捷高效的替代方案。因此,该方法为ED任务(特别是在具备电力系统基础模型的情况下)展现出了极具前景的未来解决方案。


AI-Supported Platform for System Monitoring and Decision-Making in Nuclear Waste Management with Large Language Models

Abstract

arXiv:2505.21741v1 Announce Type: new Abstract: Nuclear waste management requires rigorous regulatory compliance assessment, demanding advanced decision-support systems capable of addressing complex legal, environmental, and safety considerations. This paper presents a multi-agent Retrieval-Augmented Generation (RAG) system that integrates large language models (LLMs) with document retrieval mechanisms to enhance decision accuracy through structured agent collaboration. Through a structured 10-round discussion model, agents collaborate to assess regulatory compliance and safety requirements while maintaining document-grounded responses. Implemented on consumer-grade hardware, the system leverages Llama 3.2 and mxbai-embed-large-v1 embeddings for efficient retrieval and semantic representation. A case study of a proposed temporary nuclear waste storage site near Winslow, Arizona, demonstrates the framework's effectiveness. Results show the Regulatory Agent achieves consistently higher relevance scores in maintaining alignment with legal frameworks, while the Safety Agent effectively manages complex risk assessments requiring multifaceted analysis. The system demonstrates progressive improvement in agreement rates between agents across discussion rounds while semantic drift decreases, indicating enhanced decision-making consistency and response coherence. The system ensures regulatory decisions remain factually grounded, dynamically adapting to evolving regulatory frameworks through real-time document retrieval. By balancing automated assessment with human oversight, this framework offers a scalable and transparent approach to regulatory governance. These findings underscore the potential of AI-driven, multi-agent systems in advancing evidence-based, accountable, and adaptive decision-making for high-stakes environmental management scenarios.

摘要

核废料管理需要严格的法规遵从性评估,这要求决策支持系统能够处理复杂的法律、环境和安全因素。本文提出一种多智能体检索增强生成(RAG)系统,通过整合大语言模型(LLMs)与文档检索机制,以结构化智能体协作提升决策准确性。系统采用10轮结构化讨论模型,各智能体协作评估法规合规性与安全要求,同时保持基于文档的响应。在消费级硬件上实现时,该系统利用Llama 3.2和mxbai-embed-large-v1嵌入模型实现高效检索与语义表征。以亚利桑那州温斯洛附近拟建临时核废料储存场为例的案例研究验证了该框架的有效性。结果表明:法规智能体在保持法律框架一致性方面持续获得更高相关性评分,而安全智能体能有效处理需多维度分析的复杂风险评估。随着讨论轮次增加,智能体间共识率逐步提升且语义漂移降低,表明决策一致性与响应连贯性增强。该系统通过实时文档检索动态适应不断演变的法规框架,确保监管决策始终基于事实。通过平衡自动化评估与人工监督,该框架为监管治理提供了可扩展且透明的解决方案。这些发现凸显了人工智能驱动的多智能体系统在推进高风险环境管理场景中循证、可问责且适应性决策方面的潜力。


Don't Think Longer, Think Wisely: Optimizing Thinking Dynamics for Large Reasoning Models

Abstract

arXiv:2505.21765v1 Announce Type: new Abstract: While recent success of large reasoning models (LRMs) significantly advanced LLMs' reasoning capability by optimizing the final answer accuracy using reinforcement learning, they may also drastically increase the output length due to overthinking, characterized by unnecessarily complex reasoning paths that waste computation and potentially degrade the performance. We hypothesize that such inefficiencies stem from LRMs' limited capability to dynamically select the proper modular reasoning strategies, termed thinking patterns at the right position. To investigate this hypothesis, we propose a dynamic optimization framework that segments model-generated reasoning paths into distinct thinking patterns, systematically identifying and promoting beneficial patterns that improve the answer while removing detrimental ones. Empirical analysis confirms that our optimized thinking paths yield more concise yet sufficiently informative trajectories, enhancing reasoning efficiency by reducing attention FLOPs by up to 47% while maintaining accuracy for originally correct responses. Moreover, a non-trivial portion of originally incorrect responses are transformed into correct ones, achieving a 15.6% accuracy improvement with reduced length. Motivated by the improvement brought by the optimized thinking paths, we apply a preference optimization technique supported by a pairwise dataset contrasting suboptimal and optimal reasoning paths. Experimental evaluations across multiple mathematical reasoning benchmarks reveal that our method notably reduces computational overhead while simultaneously improving reasoning accuracy, achieving up to a 12% accuracy improvement and reducing token usage from approximately 5,000 to 3,000 tokens.

摘要

尽管大型推理模型(LRMs)近期通过强化学习优化最终答案准确率显著提升了大型语言模型(LLMs)的推理能力,但其可能因过度思考而大幅增加输出长度——这种特征表现为不必要的复杂推理路径,既浪费计算资源又可能导致性能下降。我们假设这种低效性源于LRMs动态选择适当模块化推理策略(称为"思维模式")的能力不足。为验证该假设,我们提出一个动态优化框架:将模型生成的推理路径分割为不同思维模式,系统性地识别并提升有益模式以改进答案,同时剔除有害模式。实证分析表明,优化后的思维路径能产生更简洁且信息充分的轨迹,在保持原有正确答案准确率的同时,将注意力浮点运算量(FLOPs)降低达47%。此外,相当比例原本错误的答案被转化为正确结果,在缩短输出长度的同时实现了15.6%的准确率提升。基于优化思维路径带来的改进,我们采用偏好优化技术,通过对比次优与最优推理路径的配对数据集进行训练。在多个数学推理基准测试中,实验评估表明该方法显著降低了计算开销,同时提升推理准确率——最高实现12%的准确率提升,并将令牌使用量从约5,000个减少至3,000个。


Query, Don't Train: Privacy-Preserving Tabular Prediction from EHR Data via SQL Queries

Abstract

arXiv:2505.21801v1 Announce Type: new Abstract: Electronic health records (EHRs) contain richly structured, longitudinal data essential for predictive modeling, yet stringent privacy regulations (e.g., HIPAA, GDPR) often restrict access to individual-level records. We introduce Query, Don't Train (QDT): a structured-data foundation-model interface enabling tabular inference via LLM-generated SQL over EHRs. Instead of training on or accessing individual-level examples, QDT uses a large language model (LLM) as a schema-aware query planner to generate privacy-compliant SQL queries from a natural language task description and a test-time input. The model then extracts summary-level population statistics through these SQL queries and the LLM performs, chain-of-thought reasoning over the results to make predictions. This inference-time-only approach (1) eliminates the need for supervised model training or direct data access, (2) ensures interpretability through symbolic, auditable queries, (3) naturally handles missing features without imputation or preprocessing, and (4) effectively manages high-dimensional numerical data to enhance analytical capabilities. We validate QDT on the task of 30-day hospital readmission prediction for Type 2 diabetes patients using a MIMIC-style EHR cohort, achieving F1 = 0.70, which outperforms TabPFN (F1 = 0.68). To our knowledge, this is the first demonstration of LLM-driven, privacy-preserving structured prediction using only schema metadata and aggregate statistics - offering a scalable, interpretable, and regulation-compliant alternative to conventional foundation-model pipelines.

摘要

电子健康记录(EHRs)包含丰富且结构化的纵向数据,这对预测建模至关重要,但严格的隐私法规(如HIPAA、GDPR)通常限制对个体记录的访问。我们提出"查询而非训练"(QDT)方法:这是一种结构化数据基础模型接口,通过基于EHRs的LLM生成SQL实现表格推理。QDT无需在个体样本上训练或访问原始数据,而是利用大语言模型(LLM)作为模式感知的查询规划器,根据自然语言任务描述和测试时输入生成符合隐私要求的SQL查询。模型随后通过这些SQL查询提取汇总级群体统计量,并由LLM对结果进行思维链推理以生成预测。这种仅需推理时介入的方法具有以下优势:(1)无需监督模型训练或直接数据访问;(2)通过可审计的符号化查询确保可解释性;(3)天然处理缺失特征而无需插补或预处理;(4)有效管理高维数值数据以增强分析能力。我们在2型糖尿病患者30天再入院预测任务上验证QDT(使用MIMIC式EHR队列),取得F1=0.70,优于TabPFN(F1=0.68)。据我们所知,这是首个仅利用模式元数据和聚合统计量实现LLM驱动的隐私保护结构化预测的方案——为传统基础模型流程提供了可扩展、可解释且合规的替代方案。


R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning

Abstract

arXiv:2505.21668v1 Announce Type: new Abstract: Despite advances in reasoning and planning of R1-like models, Large Language Models (LLMs) still struggle with tasks requiring precise computation, symbolic manipulation, optimization, and algorithmic reasoning, in which textual reasoning lacks the rigor of code execution. A key challenge is enabling LLMs to decide when to use textual reasoning versus code generation. While OpenAI trains models to invoke a Code Interpreter as needed, public research lacks guidance on aligning pre-trained LLMs to effectively leverage code and generalize across diverse tasks. We present R1-Code-Interpreter, an extension of a text-only LLM trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL) to autonomously generate multiple code queries during step-by-step reasoning. We curate 144 reasoning and planning tasks (107 for training, 37 for testing), each with over 200 diverse questions. We fine-tune Qwen-2.5 models (3B/7B/14B) using various SFT and RL strategies, investigating different answer formats, reasoning vs. non-reasoning models, cold vs. warm starts, GRPO vs. PPO, and masked vs. unmasked code outputs. Unlike prior RL work on narrow domains, we find that Code Interpreter training is significantly harder due to high task diversity and expensive code execution, highlighting the critical role of the SFT stage. Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.0% to 64.1%, outperforming GPT-4o (text-only: 58.6%) and approaching GPT-4o with Code Interpreter (70.9%), with the emergent self-checking behavior via code generation. Datasets, Codes, and Models are available at https://github.com/yongchao98/R1-Code-Interpreter and https://huggingface.co/yongchao98.

摘要

尽管类R1模型在推理与规划方面取得进展,大型语言模型(LLMs)在需要精确计算、符号操作、优化和算法推理的任务中仍存在困难——这些场景下文本推理缺乏代码执行的严谨性。关键挑战在于如何让LLMs自主判断何时采用文本推理或代码生成。虽然OpenAI通过训练实现按需调用代码解释器,但公开研究缺乏关于如何对齐预训练LLMs以有效利用代码并泛化至多样任务的指导。我们提出R1-Code-Interpreter,通过对纯文本LLM进行多轮监督微调(SFT)和强化学习(RL)训练,使其能在逐步推理过程中自主生成多个代码查询。我们构建了144个推理与规划任务(107训练/37测试),每个任务包含200+多样化问题。采用不同SFT与RL策略对Qwen-2.5模型(3B/7B/14B)进行微调,研究包括:答案格式差异、推理与非推理模型对比、冷启动与热启动、GRPO与PPO算法比较,以及代码输出的掩码策略。与先前针对狭窄领域的RL研究不同,我们发现代码解释器训练因任务多样性和高昂的代码执行成本而显著困难,这凸显了SFT阶段的关键作用。最终模型R1-CI-14B将37项测试任务的平均准确率从44.0%提升至64.1%,超越GPT-4o纯文本模式(58.6%),并接近启用代码解释器的GPT-4o(70.9%),且通过代码生成展现出新兴的自检行为。数据集、代码与模型已开源:https://github.com/yongchao98/R1-Code-Interpreterhttps://huggingface.co/yongchao98。


Efficiently Enhancing General Agents With Hierarchical-categorical Memory

Abstract

arXiv:2505.22006v1 Announce Type: new Abstract: With large language models (LLMs) demonstrating remarkable capabilities, there has been a surge in research on leveraging LLMs to build general-purpose multi-modal agents. However, existing approaches either rely on computationally expensive end-to-end training using large-scale multi-modal data or adopt tool-use methods that lack the ability to continuously learn and adapt to new environments. In this paper, we introduce EHC, a general agent capable of learning without parameter updates. EHC consists of a Hierarchical Memory Retrieval (HMR) module and a Task-Category Oriented Experience Learning (TOEL) module. The HMR module facilitates rapid retrieval of relevant memories and continuously stores new information without being constrained by memory capacity. The TOEL module enhances the agent's comprehension of various task characteristics by classifying experiences and extracting patterns across different categories. Extensive experiments conducted on multiple standard datasets demonstrate that EHC outperforms existing methods, achieving state-of-the-art performance and underscoring its effectiveness as a general agent for handling complex multi-modal tasks.

摘要

随着大语言模型(LLM)展现出卓越的能力,利用LLM构建通用多模态代理的研究呈现爆发式增长。然而,现有方法要么依赖基于大规模多模态数据的高计算成本端到端训练,要么采用缺乏持续学习与环境适应能力的工具使用方法。本文提出EHC——一种无需参数更新的通用学习代理,其核心由层次化记忆检索(HMR)模块和任务导向型经验学习(TOEL)模块构成。HMR模块通过高效检索相关记忆并突破存储容量限制持续更新信息;TOEL模块通过经验分类与跨类别模式提取,增强代理对不同任务特性的理解能力。在多个标准数据集上的实验表明,EHC以显著优势超越现有方法,其处理复杂多模态任务的性能达到当前最优水平,充分验证了作为通用代理的有效性。


SAGE-Eval: Evaluating LLMs for Systematic Generalizations of Safety Facts

Abstract

arXiv:2505.21828v1 Announce Type: new Abstract: Do LLMs robustly generalize critical safety facts to novel situations? Lacking this ability is dangerous when users ask naive questions. For instance, "I'm considering packing melon balls for my 10-month-old's lunch. What other foods would be good to include?" Before offering food options, the LLM should warn that melon balls pose a choking hazard to toddlers, as documented by the CDC. Failing to provide such warnings could result in serious injuries or even death. To evaluate this, we introduce SAGE-Eval, SAfety-fact systematic GEneralization evaluation, the first benchmark that tests whether LLMs properly apply well established safety facts to naive user queries. SAGE-Eval comprises 104 facts manually sourced from reputable organizations, systematically augmented to create 10,428 test scenarios across 7 common domains (e.g., Outdoor Activities, Medicine). We find that the top model, Claude-3.7-sonnet, passes only 58% of all the safety facts tested. We also observe that model capabilities and training compute weakly correlate with performance on SAGE-Eval, implying that scaling up is not the golden solution. Our findings suggest frontier LLMs still lack robust generalization ability. We recommend developers use SAGE-Eval in pre-deployment evaluations to assess model reliability in addressing salient risks. We publicly release SAGE-Eval at https://huggingface.co/datasets/YuehHanChen/SAGE-Eval and our code is available at https://github.com/YuehHanChen/SAGE-Eval/tree/main.

摘要

大语言模型(LLM)能否将关键安全知识稳健地推广至新情境?当用户提出天真问题时,缺乏这种能力是危险的。例如"我打算为10个月大的宝宝午餐准备蜜瓜球,还应该搭配哪些食物?"在推荐食物前,LLM应依据美国疾控中心(CDC)记录,警告蜜瓜球可能造成幼儿窒息风险。若未能提供此类警告,可能导致严重伤害甚至死亡。为此,我们提出SAGE-Eval(安全知识系统化泛化评估),首个评估LLM能否将公认安全知识正确应用于天真用户提问的基准。该基准包含从权威机构手动收集的104项安全知识,经系统化扩展形成7大常见领域(如户外活动、医药)共10,428个测试场景。研究发现,表现最佳的Claude-3.7-sonnet模型仅通过58%的安全知识测试。同时观察到模型能力与训练算力仅与SAGE-Eval表现呈弱相关性,表明单纯扩大规模并非最佳解决方案。研究结果表明前沿LLM仍缺乏稳健的泛化能力。建议开发者在部署前使用SAGE-Eval评估模型应对突出风险的可靠性。我们已在https://huggingface.co/datasets/YuehHanChen/SAGE-Eval 公开SAGE-Eval数据集,代码发布于https://github.com/YuehHanChen/SAGE-Eval/tree/main。


VIRAL: Vision-grounded Integration for Reward design And Learning

Abstract

arXiv:2505.22092v1 Announce Type: new Abstract: The alignment between humans and machines is a critical challenge in artificial intelligence today. Reinforcement learning, which aims to maximize a reward function, is particularly vulnerable to the risks associated with poorly designed reward functions. Recent advancements has shown that Large Language Models (LLMs) for reward generation can outperform human performance in this context. We introduce VIRAL, a pipeline for generating and refining reward functions through the use of multi-modal LLMs. VIRAL autonomously creates and interactively improves reward functions based on a given environment and a goal prompt or annotated image. The refinement process can incorporate human feedback or be guided by a description generated by a video LLM, which explains the agent's policy in video form. We evaluated VIRAL in five Gymnasium environments, demonstrating that it accelerates the learning of new behaviors while ensuring improved alignment with user intent. The source-code and demo video are available at: https://github.com/VIRAL-UCBL1/VIRAL and https://youtu.be/t4_BXugBm9Q.

摘要

人机对齐是当前人工智能领域的关键挑战。以奖励函数最大化为目标的强化学习方法,尤其容易受到设计不当的奖励函数所带来的风险影响。最新研究表明,基于大语言模型(LLMs)的奖励生成在此背景下可超越人类表现。本文提出VIRAL——一种通过多模态大语言模型生成与优化奖励函数的流程框架。该系统能基于给定环境及目标提示(或标注图像)自主创建并通过交互式迭代改进奖励函数。优化过程既可融入人类反馈,也可由视频大语言模型生成的策略描述(以视频形式呈现智能体行为)来指导实现。我们在五个Gymnasium环境中对VIRAL进行了评估,结果表明其不仅能加速新行为的学习,还能确保更精准地符合用户意图。源代码及演示视频详见:https://github.com/VIRAL-UCBL1/VIRALhttps://youtu.be/t4_BXugBm9Q。


Modeling and Optimizing User Preferences in AI Copilots: A Comprehensive Survey and Taxonomy

Abstract

arXiv:2505.21907v1 Announce Type: new Abstract: AI copilots, context-aware, AI-powered systems designed to assist users in tasks such as software development and content creation, are becoming integral to modern workflows. As these systems grow in capability and adoption, personalization has emerged as a cornerstone for ensuring usability, trust, and productivity. Central to this personalization is preference optimization: the ability of AI copilots to detect, interpret, and align with individual user preferences. While personalization techniques are well-established in domains like recommender systems and dialogue agents, their adaptation to interactive, real-time systems like AI copilots remains fragmented and underexplored. This survey addresses this gap by synthesizing research on how user preferences are captured, modeled, and refined within the design of AI copilots. We introduce a unified definition of AI copilots and propose a phase-based taxonomy of preference optimization strategies, structured around pre-interaction, mid-interaction, and post-interaction stages. We analyze techniques for acquiring preference signals, modeling user intent, and integrating feedback loops, highlighting both established approaches and recent innovations. By bridging insights from AI personalization, human-AI collaboration, and large language model adaptation, this survey provides a structured foundation for designing adaptive, preference-aware AI copilots. It offers a holistic view of the available preference resources, how they can be leveraged, and which technical approaches are most suited to each stage of system design.

摘要

AI协作者(AI copilots)作为情境感知、人工智能驱动的辅助系统,旨在帮助用户完成软件开发与内容创作等任务,正逐渐成为现代工作流程的核心组成部分。随着系统能力与应用范围的扩展,个性化已成为确保可用性、信任度与生产力的关键要素。其中偏好优化是个人化的核心环节,即AI协作者检测、解读并适应用户个体偏好的能力。尽管个性化技术在推荐系统与对话代理等领域已趋成熟,但其在AI协作者这类交互式实时系统中的适配研究仍呈现碎片化且探索不足的现状。本综述通过系统梳理AI协作者设计中用户偏好的捕获、建模与优化研究,填补了这一空白。我们提出了AI协作者的统一定义,并构建了基于交互前、交互中与交互后三阶段的偏好优化策略分类体系。通过分析偏好信号获取、用户意图建模及反馈循环整合的技术路径,既梳理了成熟方法,也突出了前沿创新。本研究融合了AI个性化、人机协作与大语言模型适配等领域的洞见,为设计具有自适应性与偏好感知能力的AI协作者提供了结构化理论基础,全面阐述了现有偏好资源的利用方式及其在系统设计各阶段的最适配技术方案。


Reinforced Reasoning for Embodied Planning

Abstract

arXiv:2505.22050v1 Announce Type: new Abstract: Embodied planning requires agents to make coherent multi-step decisions based on dynamic visual observations and natural language goals. While recent vision-language models (VLMs) excel at static perception tasks, they struggle with the temporal reasoning, spatial understanding, and commonsense grounding needed for planning in interactive environments. In this work, we introduce a reinforcement fine-tuning framework that brings R1-style reasoning enhancement into embodied planning. We first distill a high-quality dataset from a powerful closed-source model and perform supervised fine-tuning (SFT) to equip the model with structured decision-making priors. We then design a rule-based reward function tailored to multi-step action quality and optimize the policy via Generalized Reinforced Preference Optimization (GRPO). Our approach is evaluated on Embench, a recent benchmark for interactive embodied tasks, covering both in-domain and out-of-domain scenarios. Experimental results show that our method significantly outperforms models of similar or larger scale, including GPT-4o-mini and 70B+ open-source baselines, and exhibits strong generalization to unseen environments. This work highlights the potential of reinforcement-driven reasoning to advance long-horizon planning in embodied AI.

摘要

具身规划要求智能体基于动态视觉观察和自然语言目标做出连贯的多步决策。尽管当前视觉语言模型(VLMs)在静态感知任务中表现出色,但其在交互环境中进行规划所需的时间推理、空间理解和常识基础方面仍存在不足。本研究提出一种强化微调框架,将R1式推理增强引入具身规划。我们首先从强大的闭源模型中蒸馏出高质量数据集,并通过监督微调(SFT)赋予模型结构化决策先验。随后设计基于规则的多步动作质量奖励函数,采用广义强化偏好优化(GRPO)进行策略优化。该方法在交互式具身任务新基准Embench上进行评估,涵盖领域内和跨领域场景。实验结果表明,我们的方法显著优于规模相近或更大的模型(包括GPT-4o-mini和70B+开源基线),并对未见环境展现出强大泛化能力。本工作揭示了强化驱动推理在推进具身AI长程规划方面的潜力。


Visual Large Language Models Exhibit Human-Level Cognitive Flexibility in the Wisconsin Card Sorting Test

Abstract

arXiv:2505.22112v1 Announce Type: new Abstract: Cognitive flexibility has been extensively studied in human cognition but remains relatively unexplored in the context of Visual Large Language Models (VLLMs). This study assesses the cognitive flexibility of state-of-the-art VLLMs (GPT-4o, Gemini-1.5 Pro, and Claude-3.5 Sonnet) using the Wisconsin Card Sorting Test (WCST), a classic measure of set-shifting ability. Our results reveal that VLLMs achieve or surpass human-level set-shifting capabilities under chain-of-thought prompting with text-based inputs. However, their abilities are highly influenced by both input modality and prompting strategy. In addition, we find that through role-playing, VLLMs can simulate various functional deficits aligned with patients having impairments in cognitive flexibility, suggesting that VLLMs may possess a cognitive architecture, at least regarding the ability of set-shifting, similar to the brain. This study reveals the fact that VLLMs have already approached the human level on a key component underlying our higher cognition, and highlights the potential to use them to emulate complex brain processes.

摘要

认知灵活性在人类认知领域已得到广泛研究,但在视觉大语言模型(VLLMs)中的探索仍相对不足。本研究采用威斯康星卡片分类测试(WCST)——这一衡量定势转换能力的经典范式,对前沿VLLMs(GPT-4o、Gemini-1.5 Pro和Claude-3.5 Sonnet)的认知灵活性进行评估。结果表明,在思维链提示的文本输入条件下,VLLMs能够达到或超越人类水平的定势转换能力,但其表现显著受输入模态和提示策略的影响。此外,研究发现通过角色扮演,VLLMs可模拟与认知灵活性受损患者相符的多种功能性缺陷,这表明VLLMs可能具有至少就定势转换能力而言与大脑相似的认知架构。本研究揭示了VLLMs在人类高阶认知关键组成部分上已接近人类水平的事实,并凸显了其模拟复杂大脑过程的潜在价值。


Efficient Leave-one-out Approximation in LLM Multi-agent Debate Based on Introspection

Abstract

arXiv:2505.22192v1 Announce Type: new Abstract: Multi-agent systems based on large language models (LLMs) advance automatic task completion in various fields, where debate is a common cooperation form for agents to solve complicated problems with reasoning and cross-review to solidify answers. Assessing the individual contributions of agents within these debates is crucial for system refinement and outcome reliability. Traditional leave-one-out (LOO) method offers a clear framework for evaluating each agent's role but face challenges in LLM-based systems due to high computational costs and associated financial implications. This paper presents introspective-leave-one-out (IntrospecLOO), a simple yet effective prompting for approximation of LOO in LLM-powered multi-agent debates. IntrospecLOO introduces an additional querying round after standard debates, prompting agents to update their answers while ignoring responses from a designated agent. This strategy effectively isolates and gauges each participant's influence at a reduced query complexity compared to the original LOO approaches. Validation through experiments on three benchmark datasets confirms the effectiveness of IntrospecLOO.

摘要

基于大语言模型(LLM)的多智能体系统推动了各领域自动任务完成的进展,其中辩论是智能体通过推理和交叉评审来解决复杂问题并巩固答案的常见协作形式。评估这些辩论中每个智能体的个体贡献对于系统优化和结果可靠性至关重要。传统的留一法(LOO)为评估各智能体作用提供了清晰框架,但在基于LLM的系统中面临高计算成本和相应财务影响等挑战。本文提出内省留一法(IntrospecLOO),这是一种简单而有效的提示方法,用于近似计算LLM驱动的多智能体辩论中的LOO。IntrospecLOO在标准辩论后引入额外查询轮次,提示智能体在忽略指定智能体响应的情况下更新答案。与原始LOO方法相比,该策略以更低查询复杂度有效隔离并量化了每个参与者的影响。通过在三个基准数据集上的实验验证,证实了IntrospecLOO的有效性。


What Makes a Good Reasoning Chain? Uncovering Structural Patterns in Long Chain-of-Thought Reasoning

Abstract

arXiv:2505.22148v1 Announce Type: new Abstract: Recent advances in reasoning with large language models (LLMs) have popularized Long Chain-of-Thought (LCoT), a strategy that encourages deliberate and step-by-step reasoning before producing a final answer. While LCoTs have enabled expert-level performance in complex tasks, how the internal structures of their reasoning chains drive, or even predict, the correctness of final answers remains a critical yet underexplored question. In this work, we present LCoT2Tree, an automated framework that converts sequential LCoTs into hierarchical tree structures and thus enables deeper structural analysis of LLM reasoning. Using graph neural networks (GNNs), we reveal that structural patterns extracted by LCoT2Tree, including exploration, backtracking, and verification, serve as stronger predictors of final performance across a wide range of tasks and models. Leveraging an explainability technique, we further identify critical thought patterns such as over-branching that account for failures. Beyond diagnostic insights, the structural patterns by LCoT2Tree support practical applications, including improving Best-of-N decoding effectiveness. Overall, our results underscore the critical role of internal structures of reasoning chains, positioning LCoT2Tree as a powerful tool for diagnosing, interpreting, and improving reasoning in LLMs.

摘要

大语言模型(LLM)推理领域的最新进展推动了长思维链(LCoT)策略的普及,该策略鼓励在生成最终答案前进行逐步深思熟虑的推理。尽管LCoT已在复杂任务中实现专家级性能,但其推理链的内部结构如何驱动甚至预测最终答案的正确性,仍是一个关键但尚未充分探索的问题。本研究提出LCoT2Tree自动化框架,将序列化LCoT转换为层次化树结构,从而支持对LLM推理进行更深层次的结构分析。通过图神经网络(GNN),我们发现LCoT2Tree提取的结构模式(包括探索、回溯和验证)在多种任务和模型中能更有效地预测最终性能。借助可解释性技术,我们进一步识别出导致失败的临界思维模式(如过度分支)。除诊断价值外,LCoT2Tree揭示的结构模式还支持实际应用,包括提升N选优解码效率。总体而言,我们的研究结果凸显了推理链内部结构的关键作用,使LCoT2Tree成为诊断、解释和改进LLM推理的强大工具。


ChatPD: An LLM-driven Paper-Dataset Networking System

Abstract

arXiv:2505.22349v1 Announce Type: new Abstract: Scientific research heavily depends on suitable datasets for method validation, but existing academic platforms with dataset management like PapersWithCode suffer from inefficiencies in their manual workflow. To overcome this bottleneck, we present a system, called ChatPD, that utilizes Large Language Models (LLMs) to automate dataset information extraction from academic papers and construct a structured paper-dataset network. Our system consists of three key modules: \textit{paper collection}, \textit{dataset information extraction}, and \textit{dataset entity resolution} to construct paper-dataset networks. Specifically, we propose a \textit{Graph Completion and Inference} strategy to map dataset descriptions to their corresponding entities. Through extensive experiments, we demonstrate that ChatPD not only outperforms the existing platform PapersWithCode in dataset usage extraction but also achieves about 90% precision and recall in entity resolution tasks. Moreover, we have deployed ChatPD to continuously extract which datasets are used in papers, and provide a dataset discovery service, such as task-specific dataset queries and similar dataset recommendations. We open source ChatPD and the current paper-dataset network on this [GitHub repository]{https://github.com/ChatPD-web/ChatPD}.


AgentDNS: A Root Domain Naming System for LLM Agents

Abstract

arXiv:2505.22368v1 Announce Type: new Abstract: The rapid evolution of Large Language Model (LLM) agents has highlighted critical challenges in cross-vendor service discovery, interoperability, and communication. Existing protocols like model context protocol and agent-to-agent protocol have made significant strides in standardizing interoperability between agents and tools, as well as communication among multi-agents. However, there remains a lack of standardized protocols and solutions for service discovery across different agent and tool vendors. In this paper, we propose AgentDNS, a root domain naming and service discovery system designed to enable LLM agents to autonomously discover, resolve, and securely invoke third-party agent and tool services across organizational and technological boundaries. Inspired by the principles of the traditional DNS, AgentDNS introduces a structured mechanism for service registration, semantic service discovery, secure invocation, and unified billing. We detail the architecture, core functionalities, and use cases of AgentDNS, demonstrating its potential to streamline multi-agent collaboration in real-world scenarios. The source code will be published on https://github.com/agentdns.

摘要

大型语言模型(LLM)代理的快速发展凸显了跨厂商服务发现、互操作性与通信方面的关键挑战。现有协议如模型上下文协议和代理间协议在标准化代理与工具间的互操作性以及多代理通信方面取得了显著进展。然而,针对不同代理和工具厂商之间的服务发现,目前仍缺乏标准化协议与解决方案。本文提出AgentDNS,这是一个根域名命名与服务发现系统,旨在使LLM代理能够跨组织与技术边界自主发现、解析并安全调用第三方代理与工具服务。受传统DNS原理启发,AgentDNS引入了结构化服务注册机制、语义化服务发现、安全调用及统一计费方案。我们详细阐述了AgentDNS的架构、核心功能及应用场景,证明其在实际场景中优化多代理协作的潜力。源代码将发布于https://github.com/agentdns。


Rethinking the Unsolvable: When In-Context Search Meets Test-Time Scaling

Abstract

arXiv:2505.22290v1 Announce Type: new Abstract: Recent research has highlighted that Large Language Models (LLMs), even when trained to generate extended long reasoning steps, still face significant challenges on hard reasoning problems. However, much of the existing literature relies on direct prompting with simple in-context learning examples for evaluation, which largely overlooks advanced techniques to elicit LLMs' deliberate reasoning before drawing conclusions that LLMs hit a performance ceiling. In this paper, we systematically explore the combined potential of in-context search and test-time scaling on super hard reasoning tasks. We find that by employing advanced in-context search prompting to LLMs augmented with internal scaling, one can achieve transformative performance breakthroughs on tasks previously deemed "unsolvable" (e.g., reported success rates below 5%). We provide both empirical results and theoretical analysis of how this combination can unleash LLM reasoning capabilities: i) Empirically, on controlled NP-hard tasks and complex real-world planning benchmarks, our approach achieves up to a 30x improvement in success rates compared to previously reported results without any external mechanisms; ii) Theoretically, we show that in-context search prompting, when combined with internal scaling, significantly extends the complexity class of solvable reasoning problems. These findings challenge prevailing assumptions about the limitations of LLMs on complex tasks, indicating that current evaluation paradigms systematically underestimate their true potential. Our work calls for a critical reassessment of how LLM reasoning is benchmarked and a more robust evaluation strategy that fully captures the true capabilities of contemporary LLMs, which can lead to a better understanding of their operational reasoning boundaries in real-world deployments.

摘要

近期研究表明,即使经过生成长推理步骤训练的大语言模型(LLMs),在应对复杂推理问题时仍面临重大挑战。然而现有文献多基于简单上下文学习示例的直接提示进行评估,这种范式很大程度上忽视了激发LLMs审慎推理的先进技术,进而过早得出LLMs性能已达天花板的结论。本文系统探索了上下文搜索与测试时扩展在超难推理任务中的协同潜力:我们发现通过采用增强内部扩展机制的先进上下文搜索提示,可在曾被判定为"无解"(如成功率低于5%)的任务上实现突破性进展。我们通过实证结果与理论分析揭示了该组合如何释放LLM推理能力:i)实证方面,在受控NP难问题与复杂现实规划基准测试中,本方法相较既往无外部机制的研究实现了高达30倍的成功率提升;ii)理论层面,我们证明上下文搜索提示结合内部扩展能显著扩展可解推理问题的复杂度类别。这些发现挑战了关于LLMs复杂任务局限性的主流假设,表明当前评估范式系统性低估了其真实潜力。本研究呼吁对LLM推理基准进行批判性重估,并提出能充分捕捉当代LLMs真实能力的鲁棒评估策略,这将有助于更准确理解其在现实部署中的实际推理边界。


Topological Structure Learning Should Be A Research Priority for LLM-Based Multi-Agent Systems

Abstract

arXiv:2505.22467v1 Announce Type: new Abstract: Large Language Model-based Multi-Agent Systems (MASs) have emerged as a powerful paradigm for tackling complex tasks through collaborative intelligence. Nevertheless, the question of how agents should be structurally organized for optimal cooperation remains largely unexplored. In this position paper, we aim to gently redirect the focus of the MAS research community toward this critical dimension: develop topology-aware MASs for specific tasks. Specifically, the system consists of three core components - agents, communication links, and communication patterns - that collectively shape its coordination performance and efficiency. To this end, we introduce a systematic, three-stage framework: agent selection, structure profiling, and topology synthesis. Each stage would trigger new research opportunities in areas such as language models, reinforcement learning, graph learning, and generative modeling; together, they could unleash the full potential of MASs in complicated real-world applications. Then, we discuss the potential challenges and opportunities in the evaluation of multiple systems. We hope our perspective and framework can offer critical new insights in the era of agentic AI.

摘要

基于大语言模型的多智能体系统(MASs)已成为通过协作智能解决复杂任务的重要范式。然而,关于如何通过结构组织实现最优协同的问题仍鲜有研究。在本立场论文中,我们旨在引导MAS研究界关注这一关键维度:为特定任务开发具有拓扑感知能力的多智能体系统。该系统由三个核心组件构成——智能体、通信链路和通信模式——它们共同决定了系统的协调性能与效率。为此,我们提出了一个系统化的三阶段框架:智能体选择、结构剖析与拓扑合成。每个阶段都将催生语言模型、强化学习、图学习和生成建模等领域的新研究机遇;这些环节的协同将充分释放多智能体系统在复杂现实应用中的潜力。随后,我们讨论了多元系统评估中潜在的挑战与机遇。希望我们的视角与框架能为智能体AI时代提供关键的新见解。


Offset Unlearning for Large Language Models

Abstract

arXiv:2404.11045v2 Announce Type: cross Abstract: Despite the strong capabilities of Large Language Models (LLMs) to acquire knowledge from their training corpora, the memorization of sensitive information in the corpora such as copyrighted, biased, and private content has led to ethical and legal concerns. In response to these challenges, unlearning has emerged as a potential remedy for LLMs affected by problematic training data. However, previous unlearning techniques are either not applicable to black-box LLMs due to required access to model internal weights, or violate data protection principles by retaining sensitive data for inference-time correction. We propose {\delta}-Unlearning, an offset unlearning framework for black-box LLMs. Instead of tuning the black-box LLM itself, {\delta}-Unlearning learns the logit offset needed for unlearning by contrasting the logits from a pair of smaller models. Experiments demonstrate that {\delta}- Unlearning can effectively unlearn target data while maintaining similar or even stronger performance on general out-of-forget-scope tasks. {\delta}-Unlearning also effectively incorporates different unlearning algorithms, making our approach a versatile solution to adapting various existing unlearning algorithms to black-box LLMs.

摘要

尽管大型语言模型(LLM)具备从训练语料库中获取知识的强大能力,但其对语料中敏感信息(如受版权保护内容、偏见性内容和隐私内容)的记忆引发了伦理与法律问题。针对这些挑战,"遗忘学习"已成为受问题训练数据影响的LLM的潜在解决方案。然而,现有遗忘技术或因需要访问模型内部权重而无法应用于黑盒LLM,或因需保留敏感数据进行推理时校正而违反数据保护原则。我们提出{\delta}-遗忘学习——一种面向黑盒LLM的偏移遗忘框架。该方法不直接调整黑盒LLM本身,而是通过对比一对较小模型的逻辑输出来学习遗忘所需的逻辑偏移量。实验表明,{\delta}-遗忘学习能有效遗忘目标数据,同时在一般非遗忘范围任务上保持相当甚至更强的性能。该框架还能有效整合不同遗忘算法,使得现有各类遗忘算法都能适配于黑盒LLM,形成通用解决方案。


From Large AI Models to Agentic AI: A Tutorial on Future Intelligent Communications

Abstract

arXiv:2505.22311v1 Announce Type: new Abstract: With the advent of 6G communications, intelligent communication systems face multiple challenges, including constrained perception and response capabilities, limited scalability, and low adaptability in dynamic environments. This tutorial provides a systematic introduction to the principles, design, and applications of Large Artificial Intelligence Models (LAMs) and Agentic AI technologies in intelligent communication systems, aiming to offer researchers a comprehensive overview of cutting-edge technologies and practical guidance. First, we outline the background of 6G communications, review the technological evolution from LAMs to Agentic AI, and clarify the tutorial's motivation and main contributions. Subsequently, we present a comprehensive review of the key components required for constructing LAMs. We further categorize LAMs and analyze their applicability, covering Large Language Models (LLMs), Large Vision Models (LVMs), Large Multimodal Models (LMMs), Large Reasoning Models (LRMs), and lightweight LAMs. Next, we propose a LAM-centric design paradigm tailored for communications, encompassing dataset construction and both internal and external learning approaches. Building upon this, we develop an LAM-based Agentic AI system for intelligent communications, clarifying its core components such as planners, knowledge bases, tools, and memory modules, as well as its interaction mechanisms. We also introduce a multi-agent framework with data retrieval, collaborative planning, and reflective evaluation for 6G. Subsequently, we provide a detailed overview of the applications of LAMs and Agentic AI in communication scenarios. Finally, we summarize the research challenges and future directions in current studies, aiming to support the development of efficient, secure, and sustainable next-generation intelligent communication systems.

摘要

随着6G通信时代的到来,智能通信系统面临感知响应能力受限、可扩展性不足以及动态环境适应性低下等多重挑战。本教程系统性地介绍了大型人工智能模型(LAMs)与代理人工智能(Agentic AI)技术在智能通信系统中的原理、设计与应用,旨在为研究人员提供前沿技术概览与实践指导。首先,我们概述6G通信背景,梳理从LAMs到Agentic AI的技术演进脉络,阐明本教程的动机与主要贡献;随后,对构建LAMs所需的关键组件进行全面综述,进一步将LAMs分类并分析其适用性,涵盖大语言模型(LLMs)、大视觉模型(LVMs)、大多模态模型(LMMs)、大推理模型(LRMs)及轻量化LAMs等类型;接着提出面向通信的LAM中心化设计范式,包括数据集构建及内外学习两种实现路径;在此基础上构建基于LAM的智能通信代理AI系统,阐明其规划器、知识库、工具库、记忆模块等核心组件及交互机制,并针对6G场景提出具备数据检索、协同规划与反思评估能力的多代理框架;随后详细综述LAMs与Agentic AI在通信场景中的应用案例;最后总结当前研究面临的挑战与未来方向,以支持构建高效、安全、可持续的新一代智能通信系统。


More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models

Abstract

arXiv:2505.21523v1 Announce Type: cross Abstract: Test-time compute has empowered multimodal large language models to generate extended reasoning chains, yielding strong performance on tasks such as multimodal math reasoning. However, this improved reasoning ability often comes with increased hallucination: as generations become longer, models tend to drift away from image-grounded content and rely more heavily on language priors. Attention analysis shows that longer reasoning chains lead to reduced focus on visual inputs, which contributes to hallucination. To systematically study this phenomenon, we introduce RH-AUC, a metric that quantifies how a model's perception accuracy changes with reasoning length, allowing us to evaluate whether the model preserves visual grounding during reasoning. We also release RH-Bench, a diagnostic benchmark that spans a variety of multimodal tasks, designed to assess the trade-off between reasoning ability and hallucination. Our analysis reveals that (i) larger models typically achieve a better balance between reasoning and perception, and (ii) this balance is influenced more by the types and domains of training data than by its overall volume. These findings underscore the importance of evaluation frameworks that jointly consider both reasoning quality and perceptual fidelity.

摘要

测试时计算能力的提升使多模态大语言模型能够生成更长的推理链,从而在诸如多模态数学推理等任务上表现出色。然而,这种增强的推理能力往往伴随着幻觉的增加:随着生成内容变长,模型倾向于偏离图像基础内容,更多地依赖语言先验。注意力分析表明,较长的推理链会导致对视觉输入的关注减少,从而加剧幻觉。为系统研究这一现象,我们提出了RH-AUC指标,用于量化模型感知精度随推理长度的变化,从而评估模型在推理过程中是否保持视觉基础。我们还发布了RH-Bench诊断基准,涵盖多种多模态任务,旨在评估推理能力与幻觉之间的权衡。分析表明:(i)较大模型通常在推理与感知之间取得更好的平衡;(ii)这种平衡更多受训练数据的类型和领域影响,而非总体数据量。这些发现强调了需要联合考量推理质量与感知保真度的评估框架的重要性。


How Much Do Large Language Models Know about Human Motion? A Case Study in 3D Avatar Control

Abstract

arXiv:2505.21531v1 Announce Type: cross Abstract: We explore Large Language Models (LLMs)' human motion knowledge through 3D avatar control. Given a motion instruction, we prompt LLMs to first generate a high-level movement plan with consecutive steps (High-level Planning), then specify body part positions in each step (Low-level Planning), which we linearly interpolate into avatar animations as a clear verification lens for human evaluators. Through carefully designed 20 representative motion instructions with full coverage of basic movement primitives and balanced body part usage, we conduct comprehensive evaluations including human assessment of both generated animations and high-level movement plans, as well as automatic comparison with oracle positions in low-level planning. We find that LLMs are strong at interpreting the high-level body movements but struggle with precise body part positioning. While breaking down motion queries into atomic components improves planning performance, LLMs have difficulty with multi-step movements involving high-degree-of-freedom body parts. Furthermore, LLMs provide reasonable approximation for general spatial descriptions, but fail to handle precise spatial specifications in text, and the precise spatial-temporal parameters needed for avatar control. Notably, LLMs show promise in conceptualizing creative motions and distinguishing culturally-specific motion patterns.

摘要

我们通过三维虚拟角色控制探究大语言模型(LLMs)的人类运动知识。给定运动指令时,我们引导LLMs首先生成包含连续步骤的高层次运动计划(高层次规划),随后在每一步中指定身体部位位置(低层次规划),并通过线性插值将其转化为虚拟角色动画,为人类评估者提供清晰的验证视角。通过精心设计的20个具有基本运动原语全覆盖和身体部位使用平衡的代表性运动指令,我们开展了综合评估,包括对人类生成的动画和高层次运动计划的人工评估,以及与低层次规划中基准位置的自动对比分析。研究发现:LLMs擅长解释高层次身体运动,但在精确定位身体部位方面存在困难;虽然将运动查询分解为原子组件能提升规划性能,但LLMs难以处理涉及高自由度身体部位的多步骤运动;此外,LLMs能对一般空间描述提供合理近似,却无法处理文本中的精确空间规范,以及虚拟角色控制所需的精确时空参数。值得注意的是,LLMs在概念化创意运动及区分文化特异性运动模式方面展现出潜力。


OpenReview Should be Protected and Leveraged as a Community Asset for Research in the Era of Large Language Models

Abstract

arXiv:2505.21537v1 Announce Type: cross Abstract: In the era of large language models (LLMs), high-quality, domain-rich, and continuously evolving datasets capturing expert-level knowledge, core human values, and reasoning are increasingly valuable. This position paper argues that OpenReview -- the continually evolving repository of research papers, peer reviews, author rebuttals, meta-reviews, and decision outcomes -- should be leveraged more broadly as a core community asset for advancing research in the era of LLMs. We highlight three promising areas in which OpenReview can uniquely contribute: enhancing the quality, scalability, and accountability of peer review processes; enabling meaningful, open-ended benchmarks rooted in genuine expert deliberation; and supporting alignment research through real-world interactions reflecting expert assessment, intentions, and scientific values. To better realize these opportunities, we suggest the community collaboratively explore standardized benchmarks and usage guidelines around OpenReview, inviting broader dialogue on responsible data use, ethical considerations, and collective stewardship.

摘要

在大语言模型(LLMs)时代,能够捕捉专家级知识、人类核心价值与推理过程的高质量、多领域且持续演化的数据集正变得愈发珍贵。本立场论文提出,OpenReview——这个持续更新的研究论文、同行评审、作者反驳、元评审及决策结果知识库——应当被更广泛地视为LLM时代推动研究的核心社区资产。我们重点阐述了OpenReview能作出独特贡献的三个领域:提升同行评审流程的质量、可扩展性与问责性;建立基于真实专家审议的开放式基准;通过反映专家评估、意图与科学价值观的真实交互支持对齐研究。为更好实现这些潜力,我们建议社区共同探索围绕OpenReview的标准化基准与使用指南,并就负责任的数据使用、伦理考量及集体管理展开更广泛对话。


Fluent but Culturally Distant: Can Regional Training Teach Cultural Understanding?

Abstract

arXiv:2505.21548v1 Announce Type: cross Abstract: Large language models (LLMs) are used around the world but exhibit Western cultural tendencies. To address this cultural misalignment, many countries have begun developing "regional" LLMs tailored to local communities. Yet it remains unclear whether these models merely speak the language of their users or also reflect their cultural values and practices. Using India as a case study, we evaluate five Indic and five global LLMs along two key dimensions: values (via the Inglehart-Welzel map and GlobalOpinionQA) and practices (via CulturalBench and NormAd). Across all four tasks, we find that Indic models do not align more closely with Indian cultural norms than global models. In fact, an average American person is a better proxy for Indian cultural values than any Indic model. Even prompting strategies fail to meaningfully improve alignment. Ablations show that regional fine-tuning does not enhance cultural competence and may in fact hurt it by impeding recall of existing knowledge. We trace this failure to the scarcity of high-quality, untranslated, and culturally grounded pretraining and fine-tuning data. Our study positions cultural evaluation as a first-class requirement alongside multilingual benchmarks and offers a reusable methodology for developers. We call for deeper investments in culturally representative data to build and evaluate truly sovereign LLMs.

摘要

大型语言模型(LLMs)在全球范围内得到广泛应用,但呈现出西方文化倾向。为解决这种文化错位问题,许多国家已开始开发针对本地社区的"区域性"LLMs。然而,这些模型究竟仅能使用用户语言,还是同时反映了其文化价值观与实践,目前尚不明确。以印度为案例,我们沿两个关键维度评估了五个印度本土模型和五个全球模型:价值观(通过Inglehart-Welzel地图和GlobalOpinionQA)与实践(通过CulturalBench和NormAd)。在所有四项任务中,我们发现印度本土模型并未比全球模型更符合印度文化规范。事实上,普通美国人在代表印度文化价值观方面优于任何印度本土模型。即使采用提示策略也未能显著改善文化对齐性。消融实验表明,区域性微调不仅无法提升文化适应能力,反而可能因阻碍现有知识回忆而削弱该能力。我们将此问题归因于缺乏高质量、未翻译且文化根植的预训练与微调数据。本研究将文化评估定位为与多语言基准同等重要的核心要求,并为开发者提供了可复用的方法论。我们呼吁加大对文化代表性数据的投入,以构建和评估真正具有主权性的LLMs。


Image Tokens Matter: Mitigating Hallucination in Discrete Tokenizer-based Large Vision-Language Models via Latent Editing

Abstract

arXiv:2505.21547v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) with discrete image tokenizers unify multimodal representations by encoding visual inputs into a finite set of tokens. Despite their effectiveness, we find that these models still hallucinate non-existent objects. We hypothesize that this may be due to visual priors induced during training: When certain image tokens frequently co-occur in the same spatial regions and represent shared objects, they become strongly associated with the verbalizations of those objects. As a result, the model may hallucinate by evoking visually absent tokens that often co-occur with present ones. To test this assumption, we construct a co-occurrence graph of image tokens using a segmentation dataset and employ a Graph Neural Network (GNN) with contrastive learning followed by a clustering method to group tokens that frequently co-occur in similar visual contexts. We find that hallucinations predominantly correspond to clusters whose tokens dominate the input, and more specifically, that the visually absent tokens in those clusters show much higher correlation with hallucinated objects compared to tokens present in the image. Based on this observation, we propose a hallucination mitigation method that suppresses the influence of visually absent tokens by modifying latent image embeddings during generation. Experiments show our method reduces hallucinations while preserving expressivity. Code is available at https://github.com/weixingW/CGC-VTD/tree/main

摘要

采用离散图像标记器的大型视觉语言模型(LVLMs)通过将视觉输入编码为有限标记集来实现多模态表征的统一。尽管这些模型表现优异,我们发现其仍会幻觉出不存在物体。我们假设这可能是训练过程中诱导的视觉先验所致:当某些图像标记在相同空间区域频繁共现并表征相同物体时,它们会与该物体对应的语言描述形成强关联。因此,模型可能通过激活与现存标记频繁共现的视觉缺失标记而产生幻觉。为验证该假设,我们利用分割数据集构建图像标记共现图,采用图神经网络(GNN)进行对比学习后通过聚类方法,将相似视觉语境中频繁共现的标记分组。研究发现幻觉主要对应于输入中占主导地位的标记簇,且相较于图像中存在的标记,这些簇中视觉缺失标记与幻觉物体的相关性显著更高。基于此发现,我们提出一种通过在生成过程中修改潜在图像嵌入来抑制视觉缺失标记影响的幻觉缓解方法。实验表明该方法能在保持表达力的同时减少幻觉。代码详见https://github.com/weixingW/CGC-VTD/tree/main


ChemHAS: Hierarchical Agent Stacking for Enhancing Chemistry Tools

Abstract

arXiv:2505.21569v1 Announce Type: cross Abstract: Large Language Model (LLM)-based agents have demonstrated the ability to improve performance in chemistry-related tasks by selecting appropriate tools. However, their effectiveness remains limited by the inherent prediction errors of chemistry tools. In this paper, we take a step further by exploring how LLMbased agents can, in turn, be leveraged to reduce prediction errors of the tools. To this end, we propose ChemHAS (Chemical Hierarchical Agent Stacking), a simple yet effective method that enhances chemistry tools through optimizing agent-stacking structures from limited data. ChemHAS achieves state-of-the-art performance across four fundamental chemistry tasks, demonstrating that our method can effectively compensate for prediction errors of the tools. Furthermore, we identify and characterize four distinct agent-stacking behaviors, potentially improving interpretability and revealing new possibilities for AI agent applications in scientific research. Our code and dataset are publicly available at https: //anonymous.4open.science/r/ChemHAS-01E4/README.md.

摘要

基于大语言模型(LLM)的智能体已展现出通过选择合适工具来提升化学相关任务性能的能力。然而,化学工具固有的预测误差仍限制着其有效性。本文进一步探索如何利用LLM智能体来降低工具的预测误差,为此提出ChemHAS(化学分层智能体堆叠)方法——一种通过有限数据优化智能体堆叠结构来增强化学工具的简洁有效方案。ChemHAS在四项基础化学任务中实现了最先进的性能,证明该方法能有效补偿工具的预测误差。此外,我们识别并表征了四种不同的智能体堆叠行为,这有望提升可解释性,并为科学研究中AI智能体应用揭示新的可能性。代码与数据集已公开于https://anonymous.4open.science/r/ChemHAS-01E4/README.md。


AITEE -- Agentic Tutor for Electrical Engineering

Abstract

arXiv:2505.21582v1 Announce Type: cross Abstract: Intelligent tutoring systems combined with large language models offer a promising approach to address students' diverse needs and promote self-efficacious learning. While large language models possess good foundational knowledge of electrical engineering basics, they remain insufficiently capable of addressing specific questions about electrical circuits. In this paper, we present AITEE, an agent-based tutoring system for electrical engineering designed to accompany students throughout their learning process, offer individualized support, and promote self-directed learning. AITEE supports both hand-drawn and digital circuits through an adapted circuit reconstruction process, enabling natural interaction with students. Our novel graph-based similarity measure identifies relevant context from lecture materials through a retrieval augmented generation approach, while parallel Spice simulation further enhances accuracy in applying solution methodologies. The system implements a Socratic dialogue to foster learner autonomy through guided questioning. Experimental evaluations demonstrate that AITEE significantly outperforms baseline approaches in domain-specific knowledge application, with even medium-sized LLM models showing acceptable performance. Our results highlight the potential of agentic tutors to deliver scalable, personalized, and effective learning environments for electrical engineering education.

摘要

智能辅导系统与大型语言模型相结合,为解决学生多样化需求和促进自我效能学习提供了可行方案。尽管大型语言模型具备电气工程基础知识的良好储备,但在处理电路相关具体问题时仍存在不足。本文提出AITEE——一个基于智能体的电气工程辅导系统,旨在全程陪伴学生学习过程,提供个性化支持并促进自主式学习。该系统通过改进的电路重建流程同时支持手绘与数字电路,实现与学生的自然交互。我们提出的新型图结构相似度度量方法,结合检索增强生成技术从讲义材料中识别相关上下文,而并行Spice仿真则进一步提升解决方案方法的应用准确性。系统采用苏格拉底式对话机制,通过引导式提问培养学习者自主性。实验评估表明,AITEE在领域知识应用方面显著优于基线方法,即使中等规模的语言模型也展现出可接受的性能。研究结果凸显了智能体辅导系统在电气工程教育中构建可扩展、个性化且高效学习环境的潜力。


RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving

Abstract

arXiv:2505.21577v1 Announce Type: cross Abstract: The ultimate goal of code agents is to solve complex tasks autonomously. Although large language models (LLMs) have made substantial progress in code generation, real-world tasks typically demand full-fledged code repositories rather than simple scripts. Building such repositories from scratch remains a major challenge. Fortunately, GitHub hosts a vast, evolving collection of open-source repositories, which developers frequently reuse as modular components for complex tasks. Yet, existing frameworks like OpenHands and SWE-Agent still struggle to effectively leverage these valuable resources. Relying solely on README files provides insufficient guidance, and deeper exploration reveals two core obstacles: overwhelming information and tangled dependencies of repositories, both constrained by the limited context windows of current LLMs. To tackle these issues, we propose RepoMaster, an autonomous agent framework designed to explore and reuse GitHub repositories for solving complex tasks. For efficient understanding, RepoMaster constructs function-call graphs, module-dependency graphs, and hierarchical code trees to identify essential components, providing only identified core elements to the LLMs rather than the entire repository. During autonomous execution, it progressively explores related components using our exploration tools and prunes information to optimize context usage. Evaluated on the adjusted MLE-bench, RepoMaster achieves a 110% relative boost in valid submissions over the strongest baseline OpenHands. On our newly released GitTaskBench, RepoMaster lifts the task-pass rate from 24.1% to 62.9% while reducing token usage by 95%. Our code and demonstration materials are publicly available at https://github.com/wanghuacan/RepoMaster.

摘要

代码智能体的终极目标是自主解决复杂任务。尽管大语言模型在代码生成方面取得显著进展,但现实任务通常需要完整的代码仓库而非简单脚本。从零开始构建此类仓库仍面临重大挑战。幸运的是,GitHub托管着庞大且持续演进的开源仓库集合,开发者常将其作为模块化组件复用于复杂任务。然而,现有框架如OpenHands和SWE-Agent仍难以有效利用这些宝贵资源:仅依赖README文件提供的指导不足,深入分析后我们发现两大核心障碍——仓库信息过载与依赖关系错综复杂,二者均受限于当前大语言模型的有限上下文窗口。为解决这些问题,我们提出RepoMaster——一个专为探索和复用GitHub仓库以解决复杂任务而设计的自主智能体框架。该框架通过构建函数调用图、模块依赖图及分层代码树来识别核心组件,仅向大语言模型提供已识别的关键元素而非整个仓库。在自主执行过程中,它利用我们的探索工具逐步关联相关组件,并通过信息剪枝优化上下文使用。在调整后的MLE-bench评估中,RepoMaster相较最强基线OpenHands实现有效提交量110%的相对提升。在我们新发布的GitTaskBench上,RepoMaster将任务通过率从24.1%提升至62.9%,同时减少95%的token消耗。代码及演示材料已公开于https://github.com/wanghuacan/RepoMaster。


Public Discourse Sandbox: Facilitating Human and AI Digital Communication Research

Abstract

arXiv:2505.21604v1 Announce Type: cross Abstract: Social media serves as a primary communication and information dissemination platform for major global events, entertainment, and niche or topically focused community discussions. Therefore, it represents a valuable resource for researchers who aim to understand numerous questions. However, obtaining data can be difficult, expensive, and often unreliable due to the presence of bots, fake accounts, and manipulated content. Additionally, there are ethical concerns if researchers decide to conduct an online experiment without explicitly notifying social media users about their intent. There is a need for more controlled and scalable mechanisms to evaluate the impacts of digital discussion interventions on audiences. We introduce the Public Discourse Sandbox (PDS), which serves as a digital discourse research platform for human-AI as well as AI-AI discourse research, testing, and training. PDS provides a safe and secure space for research experiments that are not viable on public, commercial social media platforms. Its main purpose is to enable the understanding of AI behaviors and the impacts of customized AI participants via techniques such as prompt engineering, retrieval-augmented generation (RAG), and fine-tuning. We provide a hosted live version of the sandbox to support researchers as well as the open-sourced code on GitHub for community collaboration and contribution.

摘要

社交媒体作为全球重大事件、娱乐活动以及小众或主题性社群讨论的主要传播与信息发布平台,为研究者提供了理解诸多问题的宝贵资源。然而,由于机器人账号、虚假账户和操纵性内容的存在,数据获取往往面临困难、成本高昂且可靠性不足的问题。此外,若研究者在未明确告知社交媒体用户的情况下开展在线实验,还会引发伦理争议。当前亟需建立更具可控性和扩展性的机制,以评估数字讨论干预对受众的影响。为此,我们推出"公共话语沙盒"(PDS)——一个面向人机对话及人工智能间对话研究、测试与训练的数字话语研究平台。该沙盒为无法在公共商业社交媒体平台上实施的研究实验提供了安全可靠的环境,其核心目标是通过提示工程、检索增强生成(RAG)和微调等技术,助力研究者理解AI行为模式及定制化AI参与者的影响。我们不仅提供托管式沙盒实时版本支持科研工作,同时也在GitHub开源代码以促进社区协作与贡献。


Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits

Abstract

arXiv:2505.21594v1 Announce Type: cross Abstract: Large Language Models (LLMs) enable various applications on edge devices such as smartphones, wearables, and embodied robots. However, their deployment often depends on expensive cloud-based APIs, creating high operational costs, which limit access for smaller organizations and raise sustainability concerns. Certain LLMs can be deployed on-device, offering a cost-effective solution with reduced latency and improved privacy. Yet, limited computing resources constrain the size and accuracy of models that can be deployed, necessitating a collaborative design between edge and cloud. We propose a fast and cost-effective speculative edge-cloud decoding framework with a large target model on the server and a small draft model on the device. By introducing early exits in the target model, tokens are generated mid-verification, allowing the client to preemptively draft subsequent tokens before final verification, thus utilizing idle time and enhancing parallelism between edge and cloud. Using an NVIDIA Jetson Nano (client) and an A100 GPU (server) with Vicuna-68M (draft) and Llama2-7B (target) models, our method achieves up to a 35% reduction in latency compared to cloud-based autoregressive decoding, with an additional 11% improvement from preemptive drafting. To demonstrate real-world applicability, we deploy our method on the Unitree Go2 quadruped robot using Vision-Language Model (VLM) based control, achieving a 21% speedup over traditional cloud-based autoregressive decoding. These results demonstrate the potential of our framework for real-time LLM and VLM applications on resource-constrained edge devices.

摘要

大型语言模型(LLMs)为智能手机、可穿戴设备和具身机器人等边缘设备提供了多样化的应用可能。然而,其部署通常依赖昂贵的云端API接口,导致高昂运营成本,这不仅限制了小型组织的使用权限,也引发了可持续性担忧。部分LLMs可采用设备端部署方案,通过降低延迟和增强隐私保护实现经济高效的解决方案。但有限的计算资源制约了可部署模型的规模与精度,需要边缘与云端协同设计。我们提出一种快速高效的边缘-云端推测式解码框架,在服务器端部署大型目标模型,在设备端运行小型草稿模型。通过在目标模型中引入早期退出机制,令牌可在验证过程中生成,使得客户端能在最终验证前预起草后续令牌,从而利用空闲时间并增强边缘与云端的并行性。基于NVIDIA Jetson Nano(客户端)和A100 GPU(服务器)平台,配合Vicuna-68M(草稿)与Llama2-7B(目标)模型,我们的方法相比云端自回归解码可降低35%的延迟,其中预起草机制额外贡献了11%的改进。为验证实际应用价值,我们将该方法部署于Unitree Go2四足机器人,采用基于视觉语言模型(VLM)的控制方案,较传统云端自回归解码实现了21%的加速。这些结果证明了本框架在资源受限边缘设备上实现实时LLM和VLM应用的潜力。


The Feasibility of Topic-Based Watermarking on Academic Peer Reviews

Abstract

arXiv:2505.21636v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly integrated into academic workflows, with many conferences and journals permitting their use for tasks such as language refinement and literature summarization. However, their use in peer review remains prohibited due to concerns around confidentiality breaches, hallucinated content, and inconsistent evaluations. As LLM-generated text becomes more indistinguishable from human writing, there is a growing need for reliable attribution mechanisms to preserve the integrity of the review process. In this work, we evaluate topic-based watermarking (TBW), a lightweight, semantic-aware technique designed to embed detectable signals into LLM-generated text. We conduct a comprehensive assessment across multiple LLM configurations, including base, few-shot, and fine-tuned variants, using authentic peer review data from academic conferences. Our results show that TBW maintains review quality relative to non-watermarked outputs, while demonstrating strong robustness to paraphrasing-based evasion. These findings highlight the viability of TBW as a minimally intrusive and practical solution for enforcing LLM usage in peer review.

摘要

大型语言模型(LLMs)正日益融入学术工作流程,许多会议和期刊允许将其用于语言润色和文献综述等任务。然而,由于担心泄露机密信息、生成虚构内容及评价不一致等问题,同行评审中仍禁止使用LLMs。随着LLM生成文本与人类写作的区分度逐渐降低,建立可靠的溯源机制以维护评审过程的完整性变得愈发重要。本研究评估了基于主题的水印技术(TBW)——一种轻量级、语义感知的方法,旨在向LLM生成文本中嵌入可检测信号。我们使用学术会议的真实同行评审数据,对多种LLM配置(包括基础模型、少样本学习及微调变体)进行了全面评估。结果表明,相较于无水印输出,TBW在保持评审质量的同时,对基于改写的规避行为表现出极强的鲁棒性。这些发现证明TBW可作为执行同行评审中LLM使用规范的一种低干扰、实用性解决方案。


SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge

Abstract

arXiv:2505.21605v1 Announce Type: cross Abstract: Large language models (LLMs) exhibit advancing capabilities in complex tasks, such as reasoning and graduate-level question answering, yet their resilience against misuse, particularly involving scientifically sophisticated risks, remains underexplored. Existing safety benchmarks typically focus either on instructions requiring minimal knowledge comprehension (e.g., ``tell me how to build a bomb") or utilize prompts that are relatively low-risk (e.g., multiple-choice or classification tasks about hazardous content). Consequently, they fail to adequately assess model safety when handling knowledge-intensive, hazardous scenarios. To address this critical gap, we introduce SOSBench, a regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws, systematically expanded via an LLM-assisted evolutionary pipeline that introduces diverse, realistic misuse scenarios (e.g., detailed explosive synthesis instructions involving advanced chemical formulas). We evaluate frontier models within a unified evaluation framework using our SOSBench. Despite their alignment claims, advanced models consistently disclose policy-violating content across all domains, demonstrating alarmingly high rates of harmful responses (e.g., 79.1% for Deepseek-R1 and 47.3% for GPT-4.1). These results highlight significant safety alignment deficiencies and underscore urgent concerns regarding the responsible deployment of powerful LLMs.

摘要

大型语言模型(LLMs)在复杂任务(如推理与研究生水平问答)中展现出日益增强的能力,但其抗滥用韧性——尤其是在涉及科学复杂性风险的情境下——仍未得到充分探究。现有安全基准通常聚焦于仅需基础知识理解的指令(例如"告诉我如何制作炸弹"),或采用风险相对较低的提示(如关于危险内容的多选题或分类任务),因而无法充分评估模型在知识密集型危险场景中的安全性。为填补这一关键空白,我们提出SOSBench——一个基于法规、聚焦高危领域的基准测试,涵盖化学、生物学、医学、药理学、物理学和心理学六大高风险科学领域。该基准包含3,000条源自真实法规条例的提示,通过LLM辅助的进化管道系统扩展,引入多样化且现实化的滥用场景(例如涉及高级化学公式的详细爆炸物合成指导)。我们在统一评估框架下使用SOSBench对前沿模型进行测试。尽管这些模型宣称已进行安全对齐,先进模型在所有领域均持续披露违反政策的内容,有害响应率居高不下(如Deepseek-R1达79.1%,GPT-4.1达47.3%)。这些结果揭示了显著的安全对齐缺陷,并突显了关于强大LLMs负责任部署的紧迫性问题。


R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing

Abstract

arXiv:2505.21600v1 Announce Type: cross Abstract: Large Language Models (LLMs) achieve impressive reasoning capabilities at the cost of substantial inference overhead, posing substantial deployment challenges. Although distilled Small Language Models (SLMs) significantly enhance efficiency, their performance suffers as they fail to follow LLMs' reasoning paths. Luckily, we reveal that only a small fraction of tokens genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens are either identical or exhibit neutral differences, such as minor variations in abbreviations or expressions. Leveraging this insight, we introduce Roads to Rome (R2R), a neural token routing method that selectively utilizes LLMs only for these critical, path-divergent tokens, while leaving the majority of token generation to the SLM. We also develop an automatic data generation pipeline that identifies divergent tokens and generates token-level routing labels to train the lightweight router. We apply R2R to combine R1-1.5B and R1-32B models from the DeepSeek family, and evaluate on challenging math, coding, and QA benchmarks. With an average activated parameter size of 5.6B, R2R surpasses the average accuracy of R1-7B by 1.6x, outperforming even the R1-14B model. Compared to R1-32B, it delivers a 2.8x wall-clock speedup with comparable performance, advancing the Pareto frontier of test-time scaling efficiency. Our code is available at https://github.com/thu-nics/R2R.

摘要

大型语言模型(LLMs)以显著增加的推理开销为代价获得了卓越的推理能力,这给实际部署带来了巨大挑战。尽管经过蒸馏的小型语言模型(SLMs)显著提升了效率,但其性能因无法遵循LLMs的推理路径而受限。幸运的是,我们发现仅有少量关键标记会真正导致LLMs与SLMs的推理路径分叉,大多数生成标记要么完全相同,要么仅存在中性差异(如缩写或表达方式的细微变化)。基于这一发现,我们提出R2R(Roads to Rome)——一种神经标记路由方法,该方法仅针对关键路径分叉标记选择性调用LLMs,而将大部分标记生成任务交由SLM处理。我们还开发了自动化数据生成流程,用于识别分叉标记并生成标记级路由标签以训练轻量级路由器。将R2R应用于DeepSeek家族的R1-1.5B和R1-32B模型组合后,在数学、编程和问答等挑战性基准测试中,平均激活参数量仅5.6B的R2R以1.6倍优势超越R1-7B的平均准确率,甚至优于R1-14B模型。与R1-32B相比,在保持相当性能的同时实现了2.8倍的实时加速,推进了测试时缩放效率的帕累托前沿。代码已开源:https://github.com/thu-nics/R2R。


Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives

Abstract

arXiv:2505.21627v1 Announce Type: cross Abstract: State-of-the-art large language models require specialized hardware and substantial energy to operate. As a consequence, cloud-based services that provide access to large language models have become very popular. In these services, the price users pay for an output provided by a model depends on the number of tokens the model uses to generate it -- they pay a fixed price per token. In this work, we show that this pricing mechanism creates a financial incentive for providers to strategize and misreport the (number of) tokens a model used to generate an output, and users cannot prove, or even know, whether a provider is overcharging them. However, we also show that, if an unfaithful provider is obliged to be transparent about the generative process used by the model, misreporting optimally without raising suspicion is hard. Nevertheless, as a proof-of-concept, we introduce an efficient heuristic algorithm that allows providers to significantly overcharge users without raising suspicion, highlighting the vulnerability of users under the current pay-per-token pricing mechanism. Further, to completely eliminate the financial incentive to strategize, we introduce a simple incentive-compatible token pricing mechanism. Under this mechanism, the price users pay for an output provided by a model depends on the number of characters of the output -- they pay a fixed price per character. Along the way, to illustrate and complement our theoretical results, we conduct experiments with several large language models from the \texttt{Llama}, \texttt{Gemma} and \texttt{Ministral} families, and input prompts from the LMSYS Chatbot Arena platform.

摘要

当前最先进的大语言模型需要专用硬件和大量能源才能运行。因此,提供大语言模型访问权限的云服务变得非常流行。在这些服务中,用户为模型输出支付的价格取决于模型生成输出时使用的令牌数量——他们为每个令牌支付固定价格。本研究表明,这种定价机制为服务提供商创造了策略性误报模型生成输出所用令牌数量的财务动机,而用户无法证明甚至无从知晓提供商是否存在超额收费行为。然而我们也发现,若要求不诚信的提供商必须公开模型生成过程的透明度,则要在不引起怀疑的前提下实现最优误报具有相当难度。作为概念验证,我们提出了一种高效的启发式算法,使提供商能在不引发怀疑的情况下大幅超额收费,这揭示了现行按令牌计费机制下用户的脆弱性。为进一步彻底消除策略性行为的财务动机,我们提出了一种简单的激励相容令牌定价机制。在该机制下,用户为模型输出支付的价格取决于输出内容的字符数量——他们为每个字符支付固定价格。为验证和补充理论结果,我们使用\texttt{Llama}\texttt{Gemma}\texttt{Ministral}系列的大语言模型,以及来自LMSYS Chatbot Arena平台的输入提示进行了多项实验。


Incentivizing Permissionless Distributed Learning of LLMs

Abstract

arXiv:2505.21684v1 Announce Type: cross Abstract: We describe an incentive system for distributed deep learning of foundational models where peers are rewarded for contributions. The incentive system, \textit{Gauntlet}, has been deployed on the bittensor blockchain and used to train a 1.2B LLM with completely permissionless contributions of pseudo-gradients: no control over the users that can register or their hardware. \textit{Gauntlet} can be applied to any synchronous distributed training scheme that relies on aggregating updates or pseudo-gradients. We rely on a two-stage mechanism for fast filtering of peer uptime, reliability, and synchronization, combined with the core component that estimates the loss before and after individual pseudo-gradient contributions. We utilized an OpenSkill rating system to track competitiveness of pseudo-gradient scores across time. Finally, we introduce a novel mechanism to ensure peers on the network perform unique computations. Our live 1.2B run, which has paid out real-valued tokens to participants based on the value of their contributions, yielded a competitive (on a per-iteration basis) 1.2B model that demonstrates the utility of our incentive system.

摘要

我们提出了一种用于基础模型分布式深度学习的激励机制,该机制通过奖励参与者的贡献来运作。这一名为\textit{Gauntlet}的激励系统已部署在Bittensor区块链上,并成功用于训练一个12亿参数的大型语言模型(LLM),其特点在于完全无需许可地接收伪梯度贡献:既不控制注册用户资格,也不限制其硬件条件。\textit{Gauntlet}可应用于任何依赖聚合更新或伪梯度的同步分布式训练方案。我们采用两阶段机制快速筛选节点的在线率、可靠性和同步性,其核心组件通过对比个体伪梯度贡献前后的损失值进行评估。系统采用OpenSkill评分体系持续追踪伪梯度得分的动态竞争力。最后,我们引入创新机制确保网络节点执行独特计算任务。在实际运行的12亿参数模型训练中,系统根据参与者贡献价值发放实际代币奖励,最终产出的模型在单次迭代性能上表现出竞争力,验证了该激励系统的实用价值。


Rethinking the Outlier Distribution in Large Language Models: An In-depth Study

Abstract

arXiv:2505.21670v1 Announce Type: cross Abstract: Investigating outliers in large language models (LLMs) is crucial due to their significant impact on various aspects of LLM performance, including quantization and compression. Outliers often cause considerable quantization errors, leading to degraded model performance. Identifying and addressing these outliers can enhance the accuracy and efficiency of the quantization process, enabling smoother deployment on edge devices or specialized hardware. Recent studies have identified two common types of outliers in LLMs: massive activations and channel-wise outliers. While numerous quantization algorithms have been proposed to mitigate their effects and maintain satisfactory accuracy, few have thoroughly explored the root causes of these outliers in depth. In this paper, we conduct a comprehensive investigation into the formation mechanisms of these outliers and propose potential strategies to mitigate their occurrence. Ultimately, we introduce some efficient approaches to eliminate most massive activations and channel-wise outliers with minimal impact on accuracy.

摘要

研究大型语言模型(LLMs)中的异常值至关重要,因为这些异常值对模型性能的多个方面(包括量化和压缩)具有显著影响。异常值通常会导致较大的量化误差,从而降低模型性能。识别并解决这些异常值可以提高量化过程的准确性和效率,实现在边缘设备或专用硬件上的更顺畅部署。近期研究发现了LLMs中两种常见的异常值类型:大规模激活异常和通道级异常。尽管已有大量量化算法被提出以减轻其影响并保持满意的准确度,但很少有研究深入探讨这些异常值的根本成因。本文对这些异常值的形成机制进行了全面研究,并提出了可能减少其发生的策略。最终,我们介绍了一些高效方法,可在对准确度影响最小的情况下消除大多数大规模激活异常和通道级异常。


How does Misinformation Affect Large Language Model Behaviors and Preferences?

Abstract

arXiv:2505.21608v1 Announce Type: cross Abstract: Large Language Models (LLMs) have shown remarkable capabilities in knowledge-intensive tasks, while they remain vulnerable when encountering misinformation. Existing studies have explored the role of LLMs in combating misinformation, but there is still a lack of fine-grained analysis on the specific aspects and extent to which LLMs are influenced by misinformation. To bridge this gap, we present MisBench, the current largest and most comprehensive benchmark for evaluating LLMs' behavior and knowledge preference toward misinformation. MisBench consists of 10,346,712 pieces of misinformation, which uniquely considers both knowledge-based conflicts and stylistic variations in misinformation. Empirical results reveal that while LLMs demonstrate comparable abilities in discerning misinformation, they still remain susceptible to knowledge conflicts and stylistic variations. Based on these findings, we further propose a novel approach called Reconstruct to Discriminate (RtD) to strengthen LLMs' ability to detect misinformation. Our study provides valuable insights into LLMs' interactions with misinformation, and we believe MisBench can serve as an effective benchmark for evaluating LLM-based detectors and enhancing their reliability in real-world applications. Codes and data are available at https://github.com/GKNL/MisBench.

摘要

大型语言模型(LLMs)在知识密集型任务中展现出卓越能力,但在遭遇错误信息时仍显脆弱。现有研究探讨了LLMs在应对错误信息中的作用,但对其受错误信息影响的具体方面和程度仍缺乏细粒度分析。为填补这一空白,我们提出MisBench——当前规模最大、最全面的基准测试,用于评估LLMs对错误信息的行为反应与知识偏好。该基准包含10,346,712条错误信息,创新性地同时考量了知识冲突与错误信息风格变异两个维度。实验结果表明:尽管LLMs表现出相当的误信息识别能力,其仍易受知识冲突和风格变异的影响。基于这些发现,我们进一步提出"重构判别法"(Reconstruct to Discriminate, RtD)以增强LLMs的误信息检测能力。本研究为理解LLMs与错误信息的交互机制提供了重要见解,相信MisBench可作为评估基于LLM的检测器、提升其实际应用可靠性的有效基准。代码与数据详见https://github.com/GKNL/MisBench。


LLMPR: A Novel LLM-Driven Transfer Learning based Petition Ranking Model

Abstract

arXiv:2505.21689v1 Announce Type: cross Abstract: The persistent accumulation of unresolved legal cases, especially within the Indian judiciary, significantly hampers the timely delivery of justice. Manual methods of prioritizing petitions are often prone to inefficiencies and subjective biases further exacerbating delays. To address this issue, we propose LLMPR (Large Language Model-based Petition Ranking), an automated framework that utilizes transfer learning and machine learning to assign priority rankings to legal petitions based on their contextual urgency. Leveraging the ILDC dataset comprising 7,593 annotated petitions, we process unstructured legal text and extract features through various embedding techniques, including DistilBERT, LegalBERT, and MiniLM. These textual embeddings are combined with quantitative indicators such as gap days, rank scores, and word counts to train multiple machine learning models, including Random Forest, Decision Tree, XGBoost, LightGBM, and CatBoost. Our experiments demonstrate that Random Forest and Decision Tree models yield superior performance, with accuracy exceeding 99% and a Spearman rank correlation of 0.99. Notably, models using only numerical features achieve nearly optimal ranking results (R2 = 0.988, \r{ho} = 0.998), while LLM-based embeddings offer only marginal gains. These findings suggest that automated petition ranking can effectively streamline judicial workflows, reduce case backlog, and improve fairness in legal prioritization.

摘要

未决法律案件的持续积压,特别是在印度司法系统中,严重阻碍了司法的及时执行。传统的人工请愿书优先级排序方法往往效率低下且易受主观偏见影响,进一步加剧了案件延误。为解决这一问题,我们提出LLMPR(基于大语言模型的请愿书排序框架),该自动化框架利用迁移学习和机器学习技术,根据法律请愿书的上下文紧急程度分配优先级排序。通过包含7,593份标注请愿书的ILDC数据集,我们处理非结构化法律文本,并采用DistilBERT、LegalBERT和MiniLM等多种嵌入技术提取特征。这些文本嵌入特征与间隔天数、等级分数和字数等量化指标相结合,用于训练包括随机森林、决策树、XGBoost、LightGBM和CatBoost在内的多种机器学习模型。实验结果表明,随机森林和决策树模型表现最优,准确率超过99%,斯皮尔曼等级相关系数达0.99。值得注意的是,仅使用数值特征的模型即可实现近乎最优的排序效果(R2 = 0.988,ρ = 0.998),而基于大语言模型的嵌入仅带来边际提升。这些发现表明,自动化请愿书排序能有效优化司法工作流程,减少案件积压,并提升法律优先级排序的公平性。


Privacy-Preserving Chest X-ray Report Generation via Multimodal Federated Learning with ViT and GPT-2

Abstract

arXiv:2505.21715v1 Announce Type: cross Abstract: The automated generation of radiology reports from chest X-ray images holds significant promise in enhancing diagnostic workflows while preserving patient privacy. Traditional centralized approaches often require sensitive data transfer, posing privacy concerns. To address this, the study proposes a Multimodal Federated Learning framework for chest X-ray report generation using the IU-Xray dataset. The system utilizes a Vision Transformer (ViT) as the encoder and GPT-2 as the report generator, enabling decentralized training without sharing raw data. Three Federated Learning (FL) aggregation strategies: FedAvg, Krum Aggregation and a novel Loss-aware Federated Averaging (L-FedAvg) were evaluated. Among these, Krum Aggregation demonstrated superior performance across lexical and semantic evaluation metrics such as ROUGE, BLEU, BERTScore and RaTEScore. The results show that FL can match or surpass centralized models in generating clinically relevant and semantically rich radiology reports. This lightweight and privacy-preserving framework paves the way for collaborative medical AI development without compromising data confidentiality.

摘要

基于胸部X光图像的自动化放射学报告生成在提升诊断工作流程效率的同时,能够有效保护患者隐私。传统集中式方法常需传输敏感数据,存在隐私泄露风险。为此,本研究提出一种基于IU-Xray数据集的多模态联邦学习框架,用于胸部X光报告生成。该系统采用视觉变换器(ViT)作为编码器,GPT-2作为报告生成器,实现无需共享原始数据的分布式训练。评估了三种联邦学习聚合策略:联邦平均(FedAvg)、Krum聚合以及新型损失感知联邦平均(L-FedAvg)。结果表明,Krum聚合在ROUGE、BLEU、BERTScore和RaTEScore等词汇与语义评估指标上表现最优。研究证实联邦学习模型在生成临床相关且语义丰富的放射学报告方面可媲美甚至超越集中式模型。该轻量级隐私保护框架为不妥协数据机密性的协作医疗AI开发提供了新途径。


Explainability of Large Language Models using SMILE: Statistical Model-agnostic Interpretability with Local Explanations

Abstract

arXiv:2505.21657v1 Announce Type: cross Abstract: Large language models like GPT, LLAMA, and Claude have become incredibly powerful at generating text, but they are still black boxes, so it is hard to understand how they decide what to say. That lack of transparency can be problematic, especially in fields where trust and accountability matter. To help with this, we introduce SMILE, a new method that explains how these models respond to different parts of a prompt. SMILE is model-agnostic and works by slightly changing the input, measuring how the output changes, and then highlighting which words had the most impact. Create simple visual heat maps showing which parts of a prompt matter the most. We tested SMILE on several leading LLMs and used metrics such as accuracy, consistency, stability, and fidelity to show that it gives clear and reliable explanations. By making these models easier to understand, SMILE brings us one step closer to making AI more transparent and trustworthy.

摘要

诸如GPT、LLAMA和Claude等大型语言模型在文本生成方面已展现出强大能力,但其内部机制仍如同黑箱,难以理解其决策依据。这种透明度的缺失在需要信任与问责的领域尤为棘手。为此,我们提出SMILE——一种通过微调输入并测量输出变化来解释模型响应机制的新方法。该模型无关技术能精准定位提示文本中影响力最大的词汇,并生成直观的热力图进行可视化呈现。我们在多个前沿大语言模型上验证了SMILE的有效性,采用准确性、一致性、稳定性和保真度等指标证明其解释的清晰度与可靠性。通过提升模型可解释性,SMILE为增强人工智能的透明度和可信度迈出了关键一步。


Counterfactual Simulatability of LLM Explanations for Generation Tasks

Abstract

arXiv:2505.21740v1 Announce Type: cross Abstract: LLMs can be unpredictable, as even slight alterations to the prompt can cause the output to change in unexpected ways. Thus, the ability of models to accurately explain their behavior is critical, especially in high-stakes settings. One approach for evaluating explanations is counterfactual simulatability, how well an explanation allows users to infer the model's output on related counterfactuals. Counterfactual simulatability has been previously studied for yes/no question answering tasks. We provide a general framework for extending this method to generation tasks, using news summarization and medical suggestion as example use cases. We find that while LLM explanations do enable users to better predict LLM outputs on counterfactuals in the summarization setting, there is significant room for improvement for medical suggestion. Furthermore, our results suggest that the evaluation for counterfactual simulatability may be more appropriate for skill-based tasks as opposed to knowledge-based tasks.

摘要

大型语言模型(LLM)的行为可能难以预测,即使对提示进行微小改动也可能导致输出发生意料之外的变化。因此,模型准确解释自身行为的能力至关重要,特别是在高风险场景中。评估解释的一种方法是反事实可模拟性,即解释能使用户在多大程度上推断模型在相关反事实上的输出。此前反事实可模拟性研究主要针对是非问答任务。我们提出了一个通用框架,将该方法扩展至生成任务,并以新闻摘要和医疗建议作为应用案例。研究发现,在摘要场景中,LLM的解释确实能帮助用户更好地预测模型在反事实上的输出,但在医疗建议方面仍有显著改进空间。此外,结果表明反事实可模拟性评估可能更适用于基于技能的任务,而非基于知识的任务。


OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions

Abstract

arXiv:2505.21724v1 Announce Type: cross Abstract: In this paper, we introduce Online Multimodal Conversational Response Generation (OMCRG), a novel task that aims to online generate synchronized verbal and non-verbal listener feedback, conditioned on the speaker's multimodal input. OMCRG reflects natural dyadic interactions and poses new challenges in achieving synchronization between the generated audio and facial responses of the listener. To address these challenges, we innovatively introduce text as an intermediate modality to bridge the audio and facial responses. We hence propose OmniResponse, a Multimodal Large Language Model (MLLM) that autoregressively generates high-quality multi-modal listener responses. OmniResponse leverages a pretrained LLM enhanced with two novel components: Chrono-Text, which temporally anchors generated text tokens, and TempoVoice, a controllable online TTS module that produces speech synchronized with facial reactions. To support further OMCRG research, we present ResponseNet, a new dataset comprising 696 high-quality dyadic interactions featuring synchronized split-screen videos, multichannel audio, transcripts, and facial behavior annotations. Comprehensive evaluations conducted on ResponseNet demonstrate that OmniResponse significantly outperforms baseline models in terms of semantic speech content, audio-visual synchronization, and generation quality.

摘要

本文提出在线多模态对话响应生成(OMCRG)这一新任务,旨在基于说话者的多模态输入,在线生成同步的言语与非言语倾听者反馈。该任务反映了自然二元互动特性,并在实现倾听者生成音频与面部反应的同步性方面提出了新挑战。为解决这些挑战,我们创新性地引入文本作为中间模态以桥接音频与面部反应,进而提出多模态大语言模型OmniResponse,该模型能自回归生成高质量的多模态倾听者响应。OmniResponse采用预训练大语言模型架构,并集成两个新组件:时序锚定生成文本标记的Chrono-Text模块,以及可控制在线生成与面部反应同步语音的TempoVoice合成模块。为推进OMCRG研究,我们构建了包含696段高质量二元互动的ResponseNet数据集,内含同步分屏视频、多通道音频、转录文本及面部行为标注。基于ResponseNet的全面评估表明,OmniResponse在语义语音内容、视听同步性和生成质量方面显著优于基线模型。


VeriTrail: Closed-Domain Hallucination Detection with Traceability

Abstract

arXiv:2505.21786v1 Announce Type: cross Abstract: Even when instructed to adhere to source material, Language Models often generate unsubstantiated content - a phenomenon known as "closed-domain hallucination." This risk is amplified in processes with multiple generative steps (MGS), compared to processes with a single generative step (SGS). However, due to the greater complexity of MGS processes, we argue that detecting hallucinations in their final outputs is necessary but not sufficient: it is equally important to trace where hallucinated content was likely introduced and how faithful content may have been derived from the source through intermediate outputs. To address this need, we present VeriTrail, the first closed-domain hallucination detection method designed to provide traceability for both MGS and SGS processes. We also introduce the first datasets to include all intermediate outputs as well as human annotations of final outputs' faithfulness for their respective MGS processes. We demonstrate that VeriTrail outperforms baseline methods on both datasets.

摘要

即使被要求严格遵循源材料,语言模型仍经常生成未经证实的内容——这种现象被称为"闭域幻觉"。与单步生成过程(SGS)相比,多步生成过程(MGS)中这种风险会被进一步放大。然而由于MGS过程更为复杂,我们认为仅检测最终输出的幻觉虽有必要但并不充分:同等重要的是追踪幻觉内容可能被引入的环节,以及忠实内容如何通过中间输出从源材料派生。为此,我们提出了VeriTrail——首个专为MGS和SGS过程提供可追溯性的闭域幻觉检测方法。同时我们发布了首个包含所有中间输出及MGS过程最终输出忠实性人工标注的数据集。实验表明,VeriTrail在两个数据集上的表现均优于基线方法。


DualSchool: How Reliable are LLMs for Optimization Education?

Abstract

arXiv:2505.21775v1 Announce Type: cross Abstract: Consider the following task taught in introductory optimization courses which addresses challenges articulated by the community at the intersection of (generative) AI and OR: generate the dual of a linear program. LLMs, being trained at web-scale, have the conversion process and many instances of Primal to Dual Conversion (P2DC) at their disposal. Students may thus reasonably expect that LLMs would perform well on the P2DC task. To assess this expectation, this paper introduces DualSchool, a comprehensive framework for generating and verifying P2DC instances. The verification procedure of DualSchool uses the Canonical Graph Edit Distance, going well beyond existing evaluation methods for optimization models, which exhibit many false positives and negatives when applied to P2DC. Experiments performed by DualSchool reveal interesting findings. Although LLMs can recite the conversion procedure accurately, state-of-the-art open LLMs fail to consistently produce correct duals. This finding holds even for the smallest two-variable instances and for derivative tasks, such as correctness, verification, and error classification. The paper also discusses the implications for educators, students, and the development of large reasoning systems.

摘要

考虑以下在优化入门课程中讲授的任务,该任务针对(生成式)人工智能与运筹学交叉领域社区提出的挑战:生成线性规划的对偶问题。大型语言模型(LLMs)通过互联网规模训练,已掌握对偶转换流程及大量原始-对偶转换(P2DC)实例。因此学生有理由预期LLMs在P2DC任务上表现良好。为验证该预期,本文提出DualSchool框架——一个用于生成和验证P2DC实例的完整体系。DualSchool的验证程序采用规范图编辑距离,其评估深度远超现有优化模型评估方法(这些方法在P2DC任务中存在大量假阳性与假阴性)。DualSchool实验揭示了有趣发现:尽管LLMs能准确复述转换流程,但最先进的开源LLMs仍无法持续生成正确对偶形式。这一现象即便在最小的双变量实例和衍生任务(如正确性验证、错误分类)中依然存在。本文还探讨了该发现对教育者、学生及大型推理系统开发的启示。


MMTBENCH: A Unified Benchmark for Complex Multimodal Table Reasoning

Abstract

arXiv:2505.21771v1 Announce Type: cross Abstract: Multimodal tables those that integrate semi structured data with visual elements such as charts and maps are ubiquitous across real world domains, yet they pose a formidable challenge to current vision language models (VLMs). While Large Language models (LLMs) and VLMs have demonstrated strong capabilities in text and image understanding, their performance on complex, real world multimodal table reasoning remains unexplored. To bridge this gap, we introduce MMTBENCH (Multimodal Table Benchmark), a benchmark consisting of 500 real world multimodal tables drawn from diverse real world sources, with a total of 4021 question answer pairs. MMTBENCH questions cover four question types (Explicit, Implicit, Answer Mention, and Visual Based), five reasoning types (Mathematical, Extrema Identification, Fact Verification, Vision Based, and Others), and eight table types (Single/Multiple Entity, Maps and Charts with Entities, Single/Multiple Charts, Maps, and Visualizations). Extensive evaluation of state of the art models on all types reveals substantial performance gaps, particularly on questions requiring visual-based reasoning and multi-step inference. These findings show the urgent need for improved architectures that more tightly integrate vision and language processing. By providing a challenging, high-quality resource that mirrors the complexity of real-world tasks, MMTBENCH underscores its value as a resource for future research on multimodal tables.

摘要

融合半结构化数据与图表、地图等视觉元素的多模态表格在现实领域无处不在,却对当前视觉语言模型(VLMs)构成严峻挑战。尽管大语言模型(LLMs)和VLMs在文本与图像理解方面展现出强大能力,但其在复杂现实场景下的多模态表格推理性能仍未得到探索。为填补这一空白,我们提出MMTBENCH(多模态表格基准),该基准包含500个源自多样现实场景的真实多模态表格,共计4021组问答对。MMTBENCH的问题涵盖四种问题类型(显式、隐式、答案提及和视觉基础)、五种推理类型(数学计算、极值识别、事实验证、视觉基础和其它)以及八种表格类型(单/多实体、含实体地图与图表、单/多图表、地图及可视化)。通过对前沿模型的全类型评估,我们发现其存在显著性能差距,尤其在需要视觉推理和多步推断的问题上。这些发现表明,亟需开发能更紧密融合视觉与语言处理的新型架构。MMTBENCH通过提供反映现实任务复杂性的高质量挑战性资源,凸显了其作为未来多模态表格研究基础资源的价值。


Scientific Paper Retrieval with LLM-Guided Semantic-Based Ranking

Abstract

arXiv:2505.21815v1 Announce Type: cross Abstract: Scientific paper retrieval is essential for supporting literature discovery and research. While dense retrieval methods demonstrate effectiveness in general-purpose tasks, they often fail to capture fine-grained scientific concepts that are essential for accurate understanding of scientific queries. Recent studies also use large language models (LLMs) for query understanding; however, these methods often lack grounding in corpus-specific knowledge and may generate unreliable or unfaithful content. To overcome these limitations, we propose SemRank, an effective and efficient paper retrieval framework that combines LLM-guided query understanding with a concept-based semantic index. Each paper is indexed using multi-granular scientific concepts, including general research topics and detailed key phrases. At query time, an LLM identifies core concepts derived from the corpus to explicitly capture the query's information need. These identified concepts enable precise semantic matching, significantly enhancing retrieval accuracy. Experiments show that SemRank consistently improves the performance of various base retrievers, surpasses strong existing LLM-based baselines, and remains highly efficient.

摘要

科学论文检索对于支持文献发现与研究至关重要。尽管密集检索方法在通用任务中表现出有效性,但它们往往无法捕捉对准确理解科学查询至关重要的细粒度科学概念。近期研究也尝试利用大语言模型(LLM)进行查询理解,但这些方法通常缺乏对语料库特定知识的 grounding,可能生成不可靠或不忠实的内容。为克服这些局限,我们提出SemRank框架,该框架将LLM引导的查询理解与基于概念的语义索引相结合。每篇论文通过多粒度科学概念(包括通用研究主题和详细关键短语)进行索引。查询时,大语言模型会识别源自语料库的核心概念,以显式捕获查询的信息需求。这些识别出的概念可实现精确的语义匹配,显著提升检索准确性。实验表明,SemRank能持续提升各类基础检索器的性能,超越现有基于LLM的强基线,同时保持高效性。


Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones

Abstract

arXiv:2505.21825v1 Announce Type: cross Abstract: Inference-time computation has emerged as a promising scaling axis for improving large language model reasoning. However, despite yielding impressive performance, the optimal allocation of inference-time computation remains poorly understood. A central question is whether to prioritize sequential scaling (e.g., longer chains of thought) or parallel scaling (e.g., majority voting across multiple short chains of thought). In this work, we seek to illuminate the landscape of test-time scaling by demonstrating the existence of reasoning settings where sequential scaling offers an exponential advantage over parallel scaling. These settings are based on graph connectivity problems in challenging distributions of graphs. We validate our theoretical findings with comprehensive experiments across a range of language models, including models trained from scratch for graph connectivity with different chain of thought strategies as well as large reasoning models.

摘要

推理时计算已成为提升大语言模型推理能力的重要扩展方向。然而,尽管该技术能带来显著性能提升,学界对推理时计算的最优分配机制仍缺乏深入理解。核心问题在于应优先采用序列化扩展(如更长的思维链)还是并行化扩展(如跨多个短思维链的多数投票)。本研究通过证明在某些推理场景中序列化扩展相对并行化扩展具有指数级优势,旨在揭示测试时扩展的适用边界。这些场景基于具有挑战性的图分布中的连通性问题。我们通过涵盖多类语言模型的系统性实验验证理论发现,包括采用不同思维链策略从头训练的图连通性专用模型,以及通用大型推理模型。


Extracting Research Instruments from Educational Literature Using LLMs

Abstract

arXiv:2505.21855v1 Announce Type: cross Abstract: Large Language Models (LLMs) are transforming information extraction from academic literature, offering new possibilities for knowledge management. This study presents an LLM-based system designed to extract detailed information about research instruments used in the education field, including their names, types, target respondents, measured constructs, and outcomes. Using multi-step prompting and a domain-specific data schema, it generates structured outputs optimized for educational research. Our evaluation shows that this system significantly outperforms other approaches, particularly in identifying instrument names and detailed information. This demonstrates the potential of LLM-powered information extraction in educational contexts, offering a systematic way to organize research instrument information. The ability to aggregate such information at scale enhances accessibility for researchers and education leaders, facilitating informed decision-making in educational research and policy.

摘要

大语言模型(LLMs)正在改变学术文献的信息提取方式,为知识管理提供了新的可能性。本研究提出了一种基于LLM的系统,旨在从教育领域提取研究工具的详细信息,包括其名称、类型、目标受访者、测量构念及结果。该系统采用多步提示和特定领域数据模式,生成针对教育研究优化的结构化输出。评估结果表明,该系统在识别工具名称及详细信息方面显著优于其他方法,这证明了LLM驱动的信息提取在教育场景中的潜力,为系统化组织研究工具信息提供了途径。大规模聚合此类信息的能力增强了研究人员和教育领导者对数据的可获取性,有助于推动教育研究与政策制定的科学决策。


Beyond Perception: Evaluating Abstract Visual Reasoning through Multi-Stage Task

Abstract

arXiv:2505.21850v1 Announce Type: cross Abstract: Current Multimodal Large Language Models (MLLMs) excel in general visual reasoning but remain underexplored in Abstract Visual Reasoning (AVR), which demands higher-order reasoning to identify abstract rules beyond simple perception. Existing AVR benchmarks focus on single-step reasoning, emphasizing the end result but neglecting the multi-stage nature of reasoning process. Past studies found MLLMs struggle with these benchmarks, but it doesn't explain how they fail. To address this gap, we introduce MultiStAR, a Multi-Stage AVR benchmark, based on RAVEN, designed to assess reasoning across varying levels of complexity. Additionally, existing metrics like accuracy only focus on the final outcomes while do not account for the correctness of intermediate steps. Therefore, we propose a novel metric, MSEval, which considers the correctness of intermediate steps in addition to the final outcomes. We conduct comprehensive experiments on MultiStAR using 17 representative close-source and open-source MLLMs. The results reveal that while existing MLLMs perform adequately on basic perception tasks, they continue to face challenges in more complex rule detection stages.

摘要

当前的多模态大语言模型(MLLMs)在通用视觉推理任务中表现优异,但在抽象视觉推理(AVR)领域的研究仍显不足。AVR需要超越简单感知的高阶推理能力以识别抽象规则。现有AVR基准测试主要关注单步推理,强调最终结果而忽视了推理过程的多阶段性。既往研究发现MLLMs在这些基准测试中表现欠佳,但未能揭示其失败机制。为填补这一空白,我们基于RAVEN框架开发了MultiStAR——一个多阶段AVR评估基准,旨在测试模型在不同复杂度层级上的推理能力。此外,现有评估指标(如准确率)仅关注最终结果,未能考量中间步骤的正确性。为此,我们提出新型评估指标MSEval,该指标同时兼顾中间步骤与最终结果的正确性。我们使用17个具有代表性的闭源和开源MLLMs在MultiStAR上开展了全面实验。结果表明:现有MLLMs在基础感知任务中表现尚可,但在更复杂的规则检测阶段仍面临显著挑战。


Evaluating the Retrieval Robustness of Large Language Models

Abstract

arXiv:2505.21870v1 Announce Type: cross Abstract: Retrieval-augmented generation (RAG) generally enhances large language models' (LLMs) ability to solve knowledge-intensive tasks. But RAG may also lead to performance degradation due to imperfect retrieval and the model's limited ability to leverage retrieved content. In this work, we evaluate the robustness of LLMs in practical RAG setups (henceforth retrieval robustness). We focus on three research questions: (1) whether RAG is always better than non-RAG; (2) whether more retrieved documents always lead to better performance; (3) and whether document orders impact results. To facilitate this study, we establish a benchmark of 1500 open-domain questions, each with retrieved documents from Wikipedia. We introduce three robustness metrics, each corresponds to one research question. Our comprehensive experiments, involving 11 LLMs and 3 prompting strategies, reveal that all of these LLMs exhibit surprisingly high retrieval robustness; nonetheless, different degrees of imperfect robustness hinders them from fully utilizing the benefits of RAG.

摘要

检索增强生成(RAG)通常能提升大语言模型(LLM)解决知识密集型任务的能力。但由于检索不完善及模型利用检索内容的能力有限,RAG也可能导致性能下降。本研究评估了LLM在实际RAG设置中的鲁棒性(以下简称检索鲁棒性),聚焦三个研究问题:(1)RAG是否始终优于非RAG;(2)更多检索文档是否总能带来更好性能;(3)文档排序是否影响结果。为此,我们构建了包含1500个开放域问题的基准数据集,每个问题均配有从维基百科检索的文档,并针对每个研究问题提出三项鲁棒性指标。通过对11种LLM和3种提示策略的全面实验,我们发现所有LLM都表现出惊人的高检索鲁棒性;然而,不同程度的不完美鲁棒性仍阻碍它们充分获取RAG的优势。


Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM Inference

Abstract

arXiv:2505.21919v1 Announce Type: cross Abstract: The increasing adoption of large language models (LLMs) with extended context windows necessitates efficient Key-Value Cache (KVC) management to optimize inference performance. Inference workloads like Retrieval-Augmented Generation (RAG) and agents exhibit high cache reusability, making efficient caching critical to reducing redundancy and improving speed. We analyze real-world KVC access patterns using publicly available traces and evaluate commercial key-value stores like Redis and state-of-the-art RDMA-based systems (CHIME [1] and Sherman [2]) for KVC metadata management. Our work demonstrates the lack of tailored storage solution for KVC prefilling, underscores the need for an efficient distributed caching system with optimized metadata management for LLM workloads, and provides insights into designing improved KVC management systems for scalable, low-latency inference.

摘要

随着大语言模型(LLMs)长上下文窗口的广泛应用,高效的键值缓存(KVC)管理成为优化推理性能的关键。检索增强生成(RAG)和智能体等推理工作负载表现出较高的缓存复用性,这使得高效缓存对减少冗余和提升速度至关重要。我们基于公开可用的轨迹数据分析了真实场景中的KVC访问模式,并评估了Redis等商用键值存储系统及基于RDMA的前沿系统(CHIME[1]和Sherman[2])在KVC元数据管理中的表现。本研究揭示了当前缺乏针对KVC预填充的专用存储方案,强调需要为LLM工作负载设计具备优化元数据管理的高效分布式缓存系统,同时为构建可扩展、低延迟的KVC管理系统提供了改进思路。


Co-Saving: Resource Aware Multi-Agent Collaboration for Software Development

Abstract

arXiv:2505.21898v1 Announce Type: cross Abstract: Recent advancements in Large Language Models (LLMs) and autonomous agents have demonstrated remarkable capabilities across various domains. However, standalone agents frequently encounter limitations when handling complex tasks that demand extensive interactions and substantial computational resources. Although Multi-Agent Systems (MAS) alleviate some of these limitations through collaborative mechanisms like task decomposition, iterative communication, and role specialization, they typically remain resource-unaware, incurring significant inefficiencies due to high token consumption and excessive execution time. To address these limitations, we propose a resource-aware multi-agent system -- Co-Saving (meaning that multiple agents collaboratively engage in resource-saving activities), which leverages experiential knowledge to enhance operational efficiency and solution quality. Our key innovation is the introduction of "shortcuts" -- instructional transitions learned from historically successful trajectories -- which allows to bypass redundant reasoning agents and expedite the collective problem-solving process. Experiments for software development tasks demonstrate significant advantages over existing methods. Specifically, compared to the state-of-the-art MAS ChatDev, our method achieves an average reduction of 50.85% in token usage, and improves the overall code quality by 10.06%.

摘要

大语言模型(LLMs)与自主智能体的最新进展已在多个领域展现出卓越能力。然而,独立智能体在处理需要大量交互和计算资源的复杂任务时仍存在局限。尽管多智能体系统(MAS)通过任务分解、迭代通信和角色专业化等协作机制缓解了部分问题,但现有系统通常缺乏资源意识,因高令牌消耗和过长执行时间导致显著效率低下。为此,我们提出一种资源感知型多智能体系统——Co-Saving(意为多个智能体协同参与资源节约活动),该系统利用经验知识提升运行效率与解决方案质量。我们的核心创新是引入"捷径"机制——从历史成功轨迹中学习到的指令跳转——可绕过冗余推理智能体以加速集体问题解决过程。在软件开发任务的实验中,本方法展现出显著优势:相较于最先进的多智能体系统ChatDev,平均降低50.85%的令牌使用量,并将整体代码质量提升10.06%。


Abstract

arXiv:2505.21908v1 Announce Type: cross Abstract: Diagnosis-Related Group (DRG) codes are essential for hospital reimbursement and operations but require labor-intensive assignment. Large Language Models (LLMs) struggle with DRG coding due to the out-of-distribution (OOD) nature of the task: pretraining corpora rarely contain private clinical or billing data. We introduce DRG-Sapphire, which uses large-scale reinforcement learning (RL) for automated DRG coding from clinical notes. Built on Qwen2.5-7B and trained with Group Relative Policy Optimization (GRPO) using rule-based rewards, DRG-Sapphire introduces a series of RL enhancements to address domain-specific challenges not seen in previous mathematical tasks. Our model achieves state-of-the-art accuracy on the MIMIC-IV benchmark and generates physician-validated reasoning for DRG assignments, significantly enhancing explainability. Our study further sheds light on broader challenges of applying RL to knowledge-intensive, OOD tasks. We observe that RL performance scales approximately linearly with the logarithm of the number of supervised fine-tuning (SFT) examples, suggesting that RL effectiveness is fundamentally constrained by the domain knowledge encoded in the base model. For OOD tasks like DRG coding, strong RL performance requires sufficient knowledge infusion prior to RL. Consequently, scaling SFT may be more effective and computationally efficient than scaling RL alone for such tasks.

摘要

诊断相关组(DRG)编码对医院报销和运营至关重要,但其人工分配过程耗时费力。大型语言模型(LLM)由于该任务的外分布(OOD)特性——预训练语料库极少包含私有临床或计费数据——在DRG编码任务中表现欠佳。我们提出DRG-Sapphire系统,该系统通过大规模强化学习(RL)实现临床记录自动编码。基于Qwen2.5-7B架构并采用基于规则的奖励函数进行群体相对策略优化(GRPO)训练,DRG-Sapphire引入了一系列RL增强技术以解决先前数学任务中未见的领域特定挑战。我们的模型在MIMIC-IV基准测试中达到最先进准确率,并能生成经医师验证的DRG分配逻辑,显著提升可解释性。本研究进一步揭示了将RL应用于知识密集型OOD任务的广泛挑战。我们观察到RL性能与监督微调(SFT)样本数量的对数近似线性相关,表明RL效果本质上受限于基础模型编码的领域知识。对于DRG编码这类OOD任务,要实现强RL性能需在RL阶段前完成充分的知识注入。因此,对此类任务而言,扩展SFT可能比单独扩展RL更具效果和计算效率。


MapStory: LLM-Powered Text-Driven Map Animation Prototyping with Human-in-the-Loop Editing

Abstract

arXiv:2505.21966v1 Announce Type: cross Abstract: We introduce MapStory, an LLM-powered animation authoring tool that generates editable map animation sequences directly from natural language text. Given a user-written script, MapStory leverages an agentic architecture to automatically produce a scene breakdown, which decomposes the script into key animation building blocks such as camera movements, visual highlights, and animated elements. Our system includes a researcher component that accurately queries geospatial information by leveraging an LLM with web search, enabling the automatic extraction of relevant regions, paths, and coordinates while allowing users to edit and query for changes or additional information to refine the results. Additionally, users can fine-tune parameters of these blocks through an interactive timeline editor. We detail the system's design and architecture, informed by formative interviews with professional animators and an analysis of 200 existing map animation videos. Our evaluation, which includes expert interviews (N=5) and a usability study (N=12), demonstrates that MapStory enables users to create map animations with ease, facilitates faster iteration, encourages creative exploration, and lowers barriers to creating map-centric stories.

摘要

我们介绍MapStory——一个基于大语言模型的动画创作工具,能够直接从自然语言文本生成可编辑的地图动画序列。该系统通过智能代理架构,将用户编写的脚本自动分解为场景构成要素,包括摄像机运动、视觉高亮和动画元素等关键动画构建模块。我们的系统配备研究组件,通过结合大语言模型与网络搜索精确查询地理空间信息,可自动提取相关区域、路径和坐标,同时允许用户通过编辑和查询来调整结果或获取补充信息。用户还可通过交互式时间线编辑器微调这些模块的参数。系统设计基于对专业动画师的初步访谈及200个现有地图动画视频的分析。评估结果显示(包括5位专家访谈和12人可用性研究),MapStory能帮助用户轻松创建地图动画,加快迭代速度,激发创意探索,并降低制作地图叙事作品的门槛。


Cross-modal RAG: Sub-dimensional Retrieval-Augmented Text-to-Image Generation

Abstract

arXiv:2505.21956v1 Announce Type: cross Abstract: Text-to-image generation increasingly demands access to domain-specific, fine-grained, and rapidly evolving knowledge that pretrained models cannot fully capture. Existing Retrieval-Augmented Generation (RAG) methods attempt to address this by retrieving globally relevant images, but they fail when no single image contains all desired elements from a complex user query. We propose Cross-modal RAG, a novel framework that decomposes both queries and images into sub-dimensional components, enabling subquery-aware retrieval and generation. Our method introduces a hybrid retrieval strategy - combining a sub-dimensional sparse retriever with a dense retriever - to identify a Pareto-optimal set of images, each contributing complementary aspects of the query. During generation, a multimodal large language model is guided to selectively condition on relevant visual features aligned to specific subqueries, ensuring subquery-aware image synthesis. Extensive experiments on MS-COCO, Flickr30K, WikiArt, CUB, and ImageNet-LT demonstrate that Cross-modal RAG significantly outperforms existing baselines in both retrieval and generation quality, while maintaining high efficiency.

摘要

文本到图像生成日益需要获取预训练模型无法完全掌握的领域特定、细粒度且快速更新的知识。现有的检索增强生成(RAG)方法试图通过检索全局相关图像来解决这一问题,但当复杂用户查询中的所需元素无法在单张图像中完整呈现时,这些方法便会失效。我们提出跨模态RAG框架,该框架将查询和图像分解为子维度组件,实现子查询感知的检索与生成。我们的方法引入了一种混合检索策略——结合子维度稀疏检索器与稠密检索器——以识别帕累托最优图像集合,其中每张图像贡献查询的互补方面。在生成过程中,通过引导多模态大语言模型选择性地以特定子查询对齐的相关视觉特征为条件,确保子查询感知的图像合成。在MS-COCO、Flickr30K、WikiArt、CUB和ImageNet-LT上的大量实验表明,跨模态RAG在检索和生成质量上均显著优于现有基线方法,同时保持高效性。


Learning Compositional Behaviors from Demonstration and Language

Abstract

arXiv:2505.21981v1 Announce Type: cross Abstract: We introduce Behavior from Language and Demonstration (BLADE), a framework for long-horizon robotic manipulation by integrating imitation learning and model-based planning. BLADE leverages language-annotated demonstrations, extracts abstract action knowledge from large language models (LLMs), and constructs a library of structured, high-level action representations. These representations include preconditions and effects grounded in visual perception for each high-level action, along with corresponding controllers implemented as neural network-based policies. BLADE can recover such structured representations automatically, without manually labeled states or symbolic definitions. BLADE shows significant capabilities in generalizing to novel situations, including novel initial states, external state perturbations, and novel goals. We validate the effectiveness of our approach both in simulation and on real robots with a diverse set of objects with articulated parts, partial observability, and geometric constraints.

摘要

我们提出"基于语言与演示的行为框架"(BLADE),一种通过整合模仿学习与模型规划实现长周期机器人操作的框架。BLADE利用语言标注的演示数据,从大语言模型(LLMs)中提取抽象动作知识,并构建结构化高层动作表示库。这些表示包含每个高层动作以视觉感知为基础的前提条件与效果,以及通过神经网络策略实现的对应控制器。BLADE能自动恢复此类结构化表示,无需人工标注状态或符号定义。该框架在应对新情境方面展现出显著能力,包括新初始状态、外部状态扰动及新目标等情况。我们通过在仿真环境和真实机器人上的实验验证了方法的有效性,测试场景涉及具有关节部件、部分可观测性及几何约束的多样化物体。


Analysis and Evaluation of Synthetic Data Generation in Speech Dysfluency Detection

Abstract

arXiv:2505.22029v1 Announce Type: cross Abstract: Speech dysfluency detection is crucial for clinical diagnosis and language assessment, but existing methods are limited by the scarcity of high-quality annotated data. Although recent advances in TTS model have enabled synthetic dysfluency generation, existing synthetic datasets suffer from unnatural prosody and limited contextual diversity. To address these limitations, we propose LLM-Dys -- the most comprehensive dysfluent speech corpus with LLM-enhanced dysfluency simulation. This dataset captures 11 dysfluency categories spanning both word and phoneme levels. Building upon this resource, we improve an end-to-end dysfluency detection framework. Experimental validation demonstrates state-of-the-art performance. All data, models, and code are open-sourced at https://github.com/Berkeley-Speech-Group/LLM-Dys.

摘要

言语不流畅检测对于临床诊断和语言评估至关重要,但现有方法受限于高质量标注数据的稀缺性。尽管近期文本转语音(TTS)模型的进展使得合成不流畅语音成为可能,但现有合成数据集存在韵律不自然和语境多样性不足的问题。为解决这些局限,我们提出LLM-Dys——首个基于大语言模型增强不流畅仿真的综合性言语不流畅语料库。该数据集涵盖词级和音素级共11类不流畅现象。基于此资源,我们改进了一种端到端不流畅检测框架,实验验证表明其性能达到当前最优水平。所有数据、模型及代码均已开源,详见https://github.com/Berkeley-Speech-Group/LLM-Dys。


Towards Comprehensive Scene Understanding: Integrating First and Third-Person Views for LVLMs

Abstract

arXiv:2505.21955v1 Announce Type: cross Abstract: Large vision-language models (LVLMs) are increasingly deployed in interactive applications such as virtual and augmented reality, where first-person (egocentric) view captured by head-mounted cameras serves as key input. While this view offers fine-grained cues about user attention and hand-object interactions, their narrow field of view and lack of global context often lead to failures on spatially or contextually demanding queries. To address this, we introduce a framework that augments egocentric inputs with third-person (exocentric) views, providing complementary information such as global scene layout and object visibility to LVLMs. We present E3VQA, the first benchmark for multi-view question answering with 4K high-quality question-answer pairs grounded in synchronized ego-exo image pairs. Additionally, we propose M3CoT, a training-free prompting technique that constructs a unified scene representation by integrating scene graphs from three complementary perspectives. M3CoT enables LVLMs to reason more effectively across views, yielding consistent performance gains (4.84% for GPT-4o and 5.94% for Gemini 2.0 Flash) over a recent CoT baseline. Our extensive evaluation reveals key strengths and limitations of LVLMs in multi-view reasoning and highlights the value of leveraging both egocentric and exocentric inputs.

摘要

大型视觉语言模型(LVLMs)正日益应用于虚拟现实和增强现实等交互式场景中,其中头戴式摄像头捕获的第一人称(自我中心)视角作为关键输入。尽管该视角能提供用户注意力及手物交互的细粒度线索,但其狭窄视野和全局语境缺失常导致空间或上下文复杂查询的失败。为此,我们提出一个框架,通过第三人称(他者中心)视角增强自我中心输入,为LVLMs提供全局场景布局、物体可见性等互补信息。我们推出首个多视角问答基准E3VQA,包含基于同步自我-他者图像对的4K高质量问答对。此外,我们提出M3CoT——一种无需训练的提示技术,通过整合三个互补视角的场景图构建统一场景表征。M3CoT使LVLMs能更有效地跨视角推理,相较近期CoT基线实现持续性能提升(GPT-4o提升4.84%,Gemini 2.0 Flash提升5.94%)。大量实验揭示了LVLMs在多视角推理中的核心优势与局限,并验证了融合自我中心与他者中心输入的价值。


LaMDAgent: An Autonomous Framework for Post-Training Pipeline Optimization via LLM Agents

Abstract

arXiv:2505.21963v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated exceptional performance across a wide range of tasks. To further tailor LLMs to specific domains or applications, post-training techniques such as Supervised Fine-Tuning (SFT), Preference Learning, and model merging are commonly employed. While each of these methods has been extensively studied in isolation, the automated construction of complete post-training pipelines remains an underexplored area. Existing approaches typically rely on manual design or focus narrowly on optimizing individual components, such as data ordering or merging strategies. In this work, we introduce LaMDAgent (short for Language Model Developing Agent), a novel framework that autonomously constructs and optimizes full post-training pipelines through the use of LLM-based agents. LaMDAgent systematically explores diverse model generation techniques, datasets, and hyperparameter configurations, leveraging task-based feedback to discover high-performing pipelines with minimal human intervention. Our experiments show that LaMDAgent improves tool-use accuracy by 9.0 points while preserving instruction-following capabilities. Moreover, it uncovers effective post-training strategies that are often overlooked by conventional human-driven exploration. We further analyze the impact of data and model size scaling to reduce computational costs on the exploration, finding that model size scalings introduces new challenges, whereas scaling data size enables cost-effective pipeline discovery.

摘要

大型语言模型(LLM)已在广泛任务中展现出卓越性能。为使其更适配特定领域或应用,通常采用监督微调(SFT)、偏好学习及模型融合等训练后优化技术。尽管这些方法各自已得到深入研究,但自动化构建完整训练后流程的领域仍待探索。现有方案多依赖人工设计或仅聚焦于优化单一组件(如数据排序或融合策略)。本研究提出LaMDAgent(语言模型开发智能体框架),该创新框架通过基于LLM的智能体自主构建并优化完整训练后流程。LaMDAgent系统性地探索多样化模型生成技术、数据集及超参数配置,利用任务反馈机制以最少人工干预发现高性能流程。实验表明,LaMDAgent在保持指令跟随能力的同时将工具使用准确率提升9.0个百分点,并能发现传统人工探索常忽略的有效训练后策略。我们进一步分析了数据与模型规模缩放对降低探索计算成本的影响,发现模型规模缩放会引入新挑战,而数据规模扩展可实现高性价比的流程发现。


Judging LLMs on a Simplex

Abstract

arXiv:2505.21972v1 Announce Type: cross Abstract: Automated evaluation of free-form outputs from large language models (LLMs) is challenging because many distinct answers can be equally valid. A common practice is to use LLMs themselves as judges, but the theoretical properties of this approach are not yet well understood. We show that a geometric framework that represents both judges and candidates as points on a probability simplex can provide helpful insight on what is or is not identifiable using LLM judges. Our theoretical analysis uncovers a "phase transition" in ranking identifiability: for binary scoring systems, true rankings are identifiable even with weak judges under mild assumptions, while rankings become non-identifiable for three or more scoring levels even with infinite data, absent additional prior knowledge. This non-identifiability highlights how uncertainty in rankings stems from not only aleatoric uncertainty (i.e., inherent stochasticity in the data) but also epistemic uncertainty regarding which assumptions hold, an aspect that has received limited attention until now. To integrate both types of uncertainty, we use Bayesian inference to encode assumptions as priors and conduct sensitivity analysis of ranking estimates and credible intervals. Empirical evaluations across multiple benchmarks demonstrate that Bayesian inference yields more accurate rankings and substantially improves coverage rates. These results underscore the importance of taking a more holistic approach to uncertainty quantification when using LLMs as judges.

摘要

大型语言模型(LLM)生成自由形式输出的自动化评估具有挑战性,因为许多不同的答案可能同样有效。当前常见做法是将LLM自身作为评判者,但这种方法的理论特性尚未得到充分理解。我们提出一种几何框架,将评判者和候选答案表示为概率单纯形上的点,该框架能够揭示使用LLM评判者时哪些内容可识别或不可识别。理论分析发现排序可识别性存在"相变"现象:在二元评分体系下,即使使用弱评判者,真实排序在温和假设下仍可识别;而对于三个及以上评分等级,即使有无限数据且无额外先验知识,排序也变得不可识别。这种不可识别性表明排序的不确定性不仅源于偶然不确定性(即数据固有的随机性),还来自关于哪些假设成立的认知不确定性——这一方面此前未受足够重视。为整合两类不确定性,我们采用贝叶斯推断将假设编码为先验分布,并对排序估计与可信区间进行敏感性分析。跨多个基准的实证评估表明,贝叶斯推断能产生更准确的排序并显著提升覆盖率。这些结果强调在使用LLM作为评判者时,需要采用更全面的不确定性量化方法。


Abstract

arXiv:2505.22003v1 Announce Type: cross Abstract: Pursuit of accessible legal assistance in India faces a critical gap, as many citizens struggle to leverage their legal rights due to limited awareness and access to relevant legal information. This paper introduces Legal Assist AI, a transformer-based model designed to bridge this gap by offering effective legal assistance through large language models (LLMs). The system retrieves relevant legal information from a curated database and generates accurate responses, enabling effective assistance for diverse users, including legal professionals, scholars, and the general public. The model was fine-tuned on extensive datasets from the Indian legal domain, including Indian Constitution, Bharatiya Nyaya Sanhita (BNS), Bharatiya Nagarik Suraksha Sanhita (BNSS) and so forth, providing a robust understanding of the complexities of Indian law. By incorporating domain-specific legal datasets, the proposed model demonstrated remarkable efficiency and specialization in legal Question-Answering. The model was evaluated against state-of-the-art models such as GPT-3.5 Turbo and Mistral 7B, achieving a 60.08% score on the AIBE, outperforming its competitors in legal reasoning and accuracy. Unlike other models, Legal Assist AI avoided common issues such as hallucinations, making it highly reliable for practical legal applications. It showcases the model's applicability in real-world legal scenarios, with future iterations aiming to enhance performance and expand its dataset to cover a broader range of multilingual and case-specific queries as well.

摘要

印度在寻求可及法律援助方面存在关键缺口,由于法律意识薄弱且难以获取相关法律信息,许多公民无法有效行使法定权利。本文提出Legal Assist AI——一种基于Transformer架构的模型,旨在通过大语言模型(LLM)提供高效法律支持以弥合这一缺口。该系统从精选数据库中检索相关法律信息并生成精准响应,能为法律从业者、学者及普通公众等不同用户群体提供有效帮助。该模型在印度宪法、《印度刑法典》(BNS)、《印度刑事诉讼法典》(BNSS)等本土法律领域的大规模数据集上进行了微调,从而对印度法律体系的复杂性具有深刻理解。通过整合专业法律数据集,该模型在法律问答任务中展现出卓越的效能与专业性。在与GPT-3.5 Turbo和Mistral 7B等前沿模型的对比评估中,其以60.08%的AIBE分数在法律推理与准确性方面超越竞争对手。不同于其他模型,Legal Assist AI有效规避了幻觉等常见问题,使其在实际法律应用中具有高度可靠性。研究证明了该模型在真实法律场景中的适用性,未来版本将通过提升性能与扩展多语言及个案查询数据集来增强服务能力。


Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization

Abstract

arXiv:2505.22038v1 Announce Type: cross Abstract: Large Vision-Language Models (LVLMs) have shown impressive performance across multi-modal tasks by encoding images into thousands of tokens. However, the large number of image tokens results in significant computational overhead, and the use of dynamic high-resolution inputs further increases this burden. Previous approaches have attempted to reduce the number of image tokens through token pruning, typically by selecting tokens based on attention scores or image token diversity. Through empirical studies, we observe that existing methods often overlook the joint impact of pruning on both the current layer's output (local) and the outputs of subsequent layers (global), leading to suboptimal pruning decisions. To address this challenge, we propose Balanced Token Pruning (BTP), a plug-and-play method for pruning vision tokens. Specifically, our method utilizes a small calibration set to divide the pruning process into multiple stages. In the early stages, our method emphasizes the impact of pruning on subsequent layers, whereas in the deeper stages, the focus shifts toward preserving the consistency of local outputs. Extensive experiments across various LVLMs demonstrate the broad effectiveness of our approach on multiple benchmarks. Our method achieves a 78% compression rate while preserving 96.7% of the original models' performance on average.

摘要

大型视觉语言模型(LVLMs)通过将图像编码为数千个标记,在多模态任务中展现出卓越性能。然而,大量图像标记导致显著的计算开销,而动态高分辨率输入的使用进一步加剧了这一负担。现有方法通常基于注意力分数或图像标记多样性进行选择,试图通过标记剪枝来减少图像标记数量。通过实证研究,我们发现现有方法往往忽视剪枝对当前层输出(局部)和后续层输出(全局)的联合影响,从而导致次优的剪枝决策。为解决这一问题,我们提出平衡标记剪枝(BTP),一种即插即用的视觉标记剪枝方法。具体而言,我们的方法利用小型校准集将剪枝过程划分为多个阶段:在早期阶段侧重剪枝对后续层的影响,而在深层阶段则转向保持局部输出的一致性。跨多种LVLMs的广泛实验表明,该方法在多个基准测试中具有普适有效性。我们的方法在实现78%压缩率的同时,平均保留了原始模型96.7%的性能。


From Failures to Fixes: LLM-Driven Scenario Repair for Self-Evolving Autonomous Driving

Abstract

arXiv:2505.22067v1 Announce Type: cross Abstract: Ensuring robust and generalizable autonomous driving requires not only broad scenario coverage but also efficient repair of failure cases, particularly those related to challenging and safety-critical scenarios. However, existing scenario generation and selection methods often lack adaptivity and semantic relevance, limiting their impact on performance improvement. In this paper, we propose \textbf{SERA}, an LLM-powered framework that enables autonomous driving systems to self-evolve by repairing failure cases through targeted scenario recommendation. By analyzing performance logs, SERA identifies failure patterns and dynamically retrieves semantically aligned scenarios from a structured bank. An LLM-based reflection mechanism further refines these recommendations to maximize relevance and diversity. The selected scenarios are used for few-shot fine-tuning, enabling targeted adaptation with minimal data. Experiments on the benchmark show that SERA consistently improves key metrics across multiple autonomous driving baselines, demonstrating its effectiveness and generalizability under safety-critical conditions.

摘要

确保自动驾驶系统的鲁棒性和泛化性不仅需要广泛的场景覆盖,还需有效修复故障案例——尤其是涉及安全关键场景的挑战性案例。然而,现有场景生成与选择方法往往缺乏自适应性和语义关联性,限制了其对性能提升的作用。本文提出SERA框架,该框架基于大语言模型(LLM),通过定向场景推荐使自动驾驶系统具备自我进化能力。SERA通过分析性能日志识别故障模式,并从结构化场景库中动态检索语义对齐的场景。基于LLM的反思机制进一步优化推荐结果,以最大化相关性与多样性。所选场景用于少样本微调,实现最小数据量下的定向适配。基准测试表明,SERA在多个自动驾驶基线模型上持续提升关键指标,验证了其在安全关键条件下的有效性与泛化能力。


Beyond path selection: Better LLMs for Scientific Information Extraction with MimicSFT and Relevance and Rule-induced(R2^2)GRPO

Abstract

arXiv:2505.22068v1 Announce Type: cross Abstract: Previous study suggest that powerful Large Language Models (LLMs) trained with Reinforcement Learning with Verifiable Rewards (RLVR) only refines reasoning path without improving the reasoning capacity in math tasks while supervised-finetuning(SFT) with distillation can. We study this from the view of Scientific information extraction (SciIE) where LLMs and reasoning LLMs underperforms small Bert-based models. SciIE require both the reasoning and memorization. We argue that both SFT and RLVR can refine the reasoning path and improve reasoning capacity in a simple way based on SciIE. We propose two-stage training with 1. MimicSFT, using structured reasoning templates without needing high-quality chain-of-thought data, 2. R2^2GRPO with relevance and rule-induced rewards. Experiments on scientific IE benchmarks show that both methods can improve the reasoning capacity. R2^2GRPO with mimicSFT surpasses baseline LLMs and specialized supervised models in relation extraction. Our code is available at https://github.com/ranlislz/R2GRPO.

摘要

先前研究表明,采用可验证奖励强化学习(RLVR)训练的大型语言模型(LLMs)在数学任务中仅能优化推理路径而无法提升推理能力,而监督微调(SFT)与蒸馏方法则可实现该目标。本研究从科学信息抽取(SciIE)视角展开探讨,发现LLMs和推理型LLMs在SciIE任务中的表现均逊色于基于Bert的小型模型。SciIE任务同时需要推理能力和记忆能力。我们论证了基于SciIE任务,SFT和RLVR均可通过简单方式优化推理路径并提升推理能力。为此提出两阶段训练框架:1. MimicSFT阶段采用结构化推理模板,无需高质量思维链数据;2. R2^2GRPO阶段引入相关性与规则诱导的奖励机制。在科学信息抽取基准测试中,两种方法均展现出推理能力提升效果。结合MimicSFT的R2^2GRPO方法在关系抽取任务中超越了基线LLMs和专用监督模型。代码已开源:https://github.com/ranlislz/R2GRPO。


Estimating the Effects of Sample Training Orders for Large Language Models without Retraining

Abstract

arXiv:2505.22042v1 Announce Type: cross Abstract: The order of training samples plays a crucial role in large language models (LLMs), significantly impacting both their external performance and internal learning dynamics. Traditional methods for investigating this effect generally require retraining the model with various sample orders, which is computationally infeasible for LLMs. In this work, we improve traditional methods by designing a retraining-free framework. By approximating Adam optimizer updates with first- and second-order Taylor expansions and utilizing random projection methods to store intermediate checkpoints, our framework can efficiently estimate model parameters for arbitrary training sample orders. Next, we apply our framework to two downstream research problems: (1) Training curriculum design for LLMs -- we base our retraining-free framework to propose a novel curriculum learning strategy that augments curriculum proposals with estimated model performances, enabling more informed sample scheduling. (2) LLMs' memorization and generalization effect analysis -- we use our retraining-free framework to estimate how the positions of training samples influence LLMs' capacity for memorization and generalization. We conduct extensive experiments to validate the effectiveness of our retraining-free framework in reproducing the true model performances, and further demonstrate its potential in optimizing LLM training curricula and analyzing the memorization and generalization effects of LLMs.

摘要

训练样本的顺序在大语言模型(LLMs)中起着至关重要的作用,显著影响其外部表现和内部学习动态。传统研究这一效应的方法通常需要以不同样本顺序重新训练模型,这对LLMs而言在计算上是不可行的。在本工作中,我们通过设计一种免重训练框架改进了传统方法。通过用一阶和二阶泰勒展开近似Adam优化器更新,并利用随机投影方法存储中间检查点,我们的框架能够高效估计任意训练样本顺序下的模型参数。接着,我们将该框架应用于两个下游研究问题:(1)LLMs的训练课程设计——基于我们的免重训练框架,提出了一种新颖的课程学习策略,通过估计模型性能增强课程提案,从而实现更明智的样本调度。(2)LLMs的记忆与泛化效应分析——利用免重训练框架估计训练样本位置如何影响LLMs的记忆与泛化能力。我们进行了大量实验验证免重训练框架在复现真实模型性能方面的有效性,并进一步展示了其在优化LLM训练课程以及分析LLMs记忆与泛化效应方面的潜力。


iDSE: Navigating Design Space Exploration in High-Level Synthesis Using LLMs

Abstract

arXiv:2505.22086v1 Announce Type: cross Abstract: High-Level Synthesis (HLS) serves as an agile hardware development tool that streamlines the circuit design by abstracting the register transfer level into behavioral descriptions, while allowing designers to customize the generated microarchitectures through optimization directives. However, the combinatorial explosion of possible directive configurations yields an intractable design space. Traditional design space exploration (DSE) methods, despite adopting heuristics or constructing predictive models to accelerate Pareto-optimal design acquisition, still suffer from prohibitive exploration costs and suboptimal results. Addressing these concerns, we introduce iDSE, the first LLM-aided DSE framework that leverages HLS design quality perception to effectively navigate the design space. iDSE intelligently pruns the design space to guide LLMs in calibrating representative initial sampling designs, expediting convergence toward the Pareto front. By exploiting the convergent and divergent thinking patterns inherent in LLMs for hardware optimization, iDSE achieves multi-path refinement of the design quality and diversity. Extensive experiments demonstrate that iDSE outperforms heuristic-based DSE methods by 5.1\times$$\sim16.6×\times in proximity to the reference Pareto front, matching NSGA-II with only 4.6% of the explored designs. Our work demonstrates the transformative potential of LLMs in scalable and efficient HLS design optimization, offering new insights into multiobjective optimization challenges.

摘要

高层次综合(HLS)作为一种敏捷的硬件开发工具,通过将寄存器传输级抽象为行为描述来简化电路设计,同时允许设计者通过优化指令自定义生成的微架构。然而,可能的指令配置组合爆炸导致设计空间难以处理。传统的设计空间探索(DSE)方法尽管采用启发式算法或构建预测模型以加速获取帕累托最优设计,但仍面临探索成本过高和结果次优的问题。针对这些问题,我们提出了iDSE——首个基于大型语言模型(LLM)辅助的DSE框架,该框架利用HLS设计质量感知有效导航设计空间。iDSE智能地剪枝设计空间,引导LLM校准具有代表性的初始采样设计,加速向帕累托前沿收敛。通过利用LLM固有的收敛性和发散性思维模式进行硬件优化,iDSE实现了设计质量与多样性的多路径优化。大量实验表明,iDSE在接近参考帕累托前沿方面优于基于启发式的DSE方法5.1×∼16.6×,仅需探索4.6%的设计即可匹配NSGA-II的性能。我们的工作展示了LLM在可扩展且高效的HLS设计优化中的变革潜力,为多目标优化挑战提供了新见解。


Knowledge Base Construction for Knowledge-Augmented Text-to-SQL

Abstract

arXiv:2505.22096v1 Announce Type: cross Abstract: Text-to-SQL aims to translate natural language queries into SQL statements, which is practical as it enables anyone to easily retrieve the desired information from databases. Recently, many existing approaches tackle this problem with Large Language Models (LLMs), leveraging their strong capability in understanding user queries and generating corresponding SQL code. Yet, the parametric knowledge in LLMs might be limited to covering all the diverse and domain-specific queries that require grounding in various database schemas, which makes generated SQLs less accurate oftentimes. To tackle this, we propose constructing the knowledge base for text-to-SQL, a foundational source of knowledge, from which we retrieve and generate the necessary knowledge for given queries. In particular, unlike existing approaches that either manually annotate knowledge or generate only a few pieces of knowledge for each query, our knowledge base is comprehensive, which is constructed based on a combination of all the available questions and their associated database schemas along with their relevant knowledge, and can be reused for unseen databases from different datasets and domains. We validate our approach on multiple text-to-SQL datasets, considering both the overlapping and non-overlapping database scenarios, where it outperforms relevant baselines substantially.

摘要

文本到SQL的目标是将自然语言查询转换为SQL语句,这一技术具有实际意义,因为它使得任何人都能轻松从数据库中检索所需信息。近年来,许多现有方法利用大型语言模型(LLMs)的强大能力来理解用户查询并生成相应的SQL代码,从而解决这一问题。然而,LLMs中的参数化知识可能不足以覆盖所有多样化且领域特定的查询,这些查询需要基于各种数据库模式进行基础验证,因此生成的SQL语句往往不够准确。为解决这一问题,我们提出为文本到SQL构建知识库,作为基础知识来源,从中检索并生成给定查询所需的知识。具体而言,与现有方法(要么手动标注知识,要么仅为每个查询生成少量知识片段)不同,我们的知识库是全面构建的,基于所有可用问题及其相关数据库模式与对应知识的组合,并可重用于不同数据集和领域的未见数据库。我们在多个文本到SQL数据集上验证了我们的方法,考虑了数据库重叠和非重叠的场景,结果表明该方法显著优于相关基线。


Multimodal Forecasting of Sparse Intraoperative Hypotension Events Powered by Language Model

Abstract

arXiv:2505.22116v1 Announce Type: cross Abstract: Intraoperative hypotension (IOH) frequently occurs under general anesthesia and is strongly linked to adverse outcomes such as myocardial injury and increased mortality. Despite its significance, IOH prediction is hindered by event sparsity and the challenge of integrating static and dynamic data across diverse patients. In this paper, we propose \textbf{IOHFuseLM}, a multimodal language model framework. To accurately identify and differentiate sparse hypotensive events, we leverage a two-stage training strategy. The first stage involves domain adaptive pretraining on IOH physiological time series augmented through diffusion methods, thereby enhancing the model sensitivity to patterns associated with hypotension. Subsequently, task fine-tuning is performed on the original clinical dataset to further enhance the ability to distinguish normotensive from hypotensive states. To enable multimodal fusion for each patient, we align structured clinical descriptions with the corresponding physiological time series at the token level. Such alignment enables the model to capture individualized temporal patterns alongside their corresponding clinical semantics. In addition, we convert static patient attributes into structured text to enrich personalized information. Experimental evaluations on two intraoperative datasets demonstrate that IOHFuseLM outperforms established baselines in accurately identifying IOH events, highlighting its applicability in clinical decision support scenarios. Our code is publicly available to promote reproducibility at https://github.com/zjt-gpu/IOHFuseLM.

摘要

术中低血压(IOH)在全身麻醉期间频繁发生,并与心肌损伤和死亡率升高等不良后果密切相关。尽管其重要性显著,但IOH预测受到事件稀疏性以及整合不同患者静态与动态数据挑战的阻碍。本文提出 extbf{IOHFuseLM},一种多模态语言模型框架。为准确识别和区分稀疏的低血压事件,我们采用两阶段训练策略:第一阶段通过扩散方法增强的IOH生理时间序列进行领域自适应预训练,从而提升模型对低血压相关模式的敏感性;随后在原始临床数据集上进行任务微调,以进一步增强区分正常血压与低血压状态的能力。为实现每位患者的多模态融合,我们在标记级别将结构化临床描述与对应生理时间序列对齐,使模型能同时捕捉个体化时序模式及其对应临床语义。此外,我们将静态患者属性转化为结构化文本以丰富个性化信息。在两个术中数据集上的实验评估表明,IOHFuseLM在准确识别IOH事件方面优于现有基线方法,凸显了其在临床决策支持场景中的适用性。我们的代码已公开以促进可复现性:https://github.com/zjt-gpu/IOHFuseLM。


MRT at SemEval-2025 Task 8: Maximizing Recovery from Tables with Multiple Steps

Abstract

arXiv:2505.22264v1 Announce Type: cross Abstract: In this paper we expose our approach to solve the \textit{SemEval 2025 Task 8: Question-Answering over Tabular Data} challenge. Our strategy leverages Python code generation with LLMs to interact with the table and get the answer to the questions. The process is composed of multiple steps: understanding the content of the table, generating natural language instructions in the form of steps to follow in order to get the answer, translating these instructions to code, running it and handling potential errors or exceptions. These steps use open source LLMs and fine grained optimized prompts for each task (step). With this approach, we achieved a score of 70.50%70.50\% for subtask 1.

摘要

本文阐述了我们在解决"SemEval 2025任务8:基于表格数据的问答"挑战中所采用的方法。我们的策略利用大语言模型生成Python代码与表格交互,从而获取问题答案。该方法包含多个步骤:理解表格内容、生成自然语言形式的解题步骤指令、将这些指令转化为可执行代码、运行代码并处理可能出现的错误或异常。这些步骤均采用开源大语言模型,并为每项子任务(步骤)设计了精细优化的提示模板。通过该方法,我们在子任务1中取得了70.50%的得分。


Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design

Abstract

arXiv:2505.22179v1 Announce Type: cross Abstract: Speculative decoding and quantization effectively accelerate memory-bound inference of large language models. Speculative decoding mitigates the memory bandwidth bottleneck by verifying multiple tokens within a single forward pass, which increases computational effort. Quantization achieves this optimization by compressing weights and activations into lower bit-widths and also reduces computations via low-bit matrix multiplications. To further leverage their strengths, we investigate the integration of these two techniques. Surprisingly, experiments applying the advanced speculative decoding method EAGLE-2 to various quantized models reveal that the memory benefits from 4-bit weight quantization are diminished by the computational load from speculative decoding. Specifically, verifying a tree-style draft incurs significantly more time overhead than a single-token forward pass on 4-bit weight quantized models. This finding led to our new speculative decoding design: a hierarchical framework that employs a small model as an intermediate stage to turn tree-style drafts into sequence drafts, leveraging the memory access benefits of the target quantized model. Experimental results show that our hierarchical approach achieves a 2.78×\times speedup across various tasks for the 4-bit weight Llama-3-70B model on an A100 GPU, outperforming EAGLE-2 by 1.31×\times. Code available at https://github.com/AI9Stars/SpecMQuant.

摘要

推测解码与量化技术能有效加速大语言模型的内存受限推理过程。推测解码通过单次前向传播验证多个令牌来缓解内存带宽瓶颈,但会增加计算负担;量化则通过将权重和激活值压缩至低位宽实现优化,并利用低位矩阵乘法减少计算量。为协同发挥两者优势,本研究探索了这两种技术的整合应用。令人意外的是,将先进的EAGLE-2推测解码方法应用于各类量化模型时发现:4比特权重量化带来的内存优势会被推测解码的计算负载所抵消。具体而言,在4比特权重量化模型上验证树状草案所需的时间开销显著高于单令牌前向传播。这一发现促使我们提出新型推测解码框架:采用小模型作为中间阶段将树状草案转为序列草案的分层架构,充分利用目标量化模型的内存访问优势。实验结果表明,在A100 GPU上对4比特权重的Llama-3-70B模型,我们的分层方法在不同任务中实现了2.78倍加速,较EAGLE-2提升1.31倍。代码详见https://github.com/AI9Stars/SpecMQuant。


Voice CMS: updating the knowledge base of a digital assistant through conversation

Abstract

arXiv:2505.22303v1 Announce Type: cross Abstract: In this study, we propose a solution based on a multi-agent LLM architecture and a voice user interface (VUI) designed to update the knowledge base of a digital assistant. Its usability is evaluated in comparison to a more traditional graphical content management system (CMS), with a focus on understanding the relationship between user preferences and the complexity of the information being provided. The findings demonstrate that, while the overall usability of the VUI is rated lower than the graphical interface, it is already preferred by users for less complex tasks. Furthermore, the quality of content entered through the VUI is comparable to that achieved with the graphical interface, even for highly complex tasks. Obtained qualitative results suggest that a hybrid interface combining the strengths of both approaches could address the key challenges identified during the experiment, such as reducing cognitive load through graphical feedback while maintaining the intuitive nature of voice-based interactions. This work highlights the potential of conversational interfaces as a viable and effective method for knowledge management in specific business contexts.

摘要

本研究提出了一种基于多智能体大语言模型架构和语音用户界面(VUI)的解决方案,旨在更新数字助理的知识库。通过与传统图形内容管理系统(CMS)的对比实验评估其可用性,重点探究用户偏好与信息复杂度之间的关系。研究发现,虽然VUI的整体可用性评分低于图形界面,但在处理低复杂度任务时已获得用户青睐。即使面对高复杂度任务,通过VUI输入的内容质量仍与图形界面相当。定性分析结果表明,结合两种界面优势的混合方案能够有效解决实验中发现的关键问题,例如通过图形反馈降低认知负荷,同时保留语音交互的直观特性。这项工作揭示了对话式界面作为特定商业场景中知识管理方法的可行性和有效性。


Breaking the Cloak! Unveiling Chinese Cloaked Toxicity with Homophone Graph and Toxic Lexicon

Abstract

arXiv:2505.22184v1 Announce Type: cross Abstract: Social media platforms have experienced a significant rise in toxic content, including abusive language and discriminatory remarks, presenting growing challenges for content moderation. Some users evade censorship by deliberately disguising toxic words through homophonic cloak, which necessitates the task of unveiling cloaked toxicity. Existing methods are mostly designed for English texts, while Chinese cloaked toxicity unveiling has not been solved yet. To tackle the issue, we propose C2^2TU, a novel training-free and prompt-free method for Chinese cloaked toxic content unveiling. It first employs substring matching to identify candidate toxic words based on Chinese homo-graph and toxic lexicon. Then it filters those candidates that are non-toxic and corrects cloaks to be their corresponding toxicities. Specifically, we develop two model variants for filtering, which are based on BERT and LLMs, respectively. For LLMs, we address the auto-regressive limitation in computing word occurrence probability and utilize the full semantic contexts of a text sequence to reveal cloaked toxic words. Extensive experiments demonstrate that C2^2TU can achieve superior performance on two Chinese toxic datasets. In particular, our method outperforms the best competitor by up to 71% on the F1 score and 35% on accuracy, respectively.

摘要

社交媒体平台上有毒内容(包括侮辱性语言和歧视性言论)的激增,给内容审核带来了日益严峻的挑战。部分用户通过谐音伪装手段故意掩饰毒性词汇以规避审查,这催生了伪装毒性内容揭露任务。现有方法大多针对英文文本设计,而中文伪装毒性揭露问题尚未得到解决。为此,我们提出C2^2TU——一种无需训练和提示的新型中文伪装毒性内容揭露方法。该方法首先基于中文同形字和毒性词典,通过子串匹配识别候选毒性词汇;随后过滤非毒性候选词,并将伪装形式校正为对应毒性表达。具体而言,我们开发了分别基于BERT和大语言模型(LLMs)的两种过滤变体。针对LLMs,我们解决了自回归特性在计算词汇出现概率时的局限,利用文本序列的完整语义上下文来揭示伪装毒性词汇。大量实验表明,C2^2TU在两个中文毒性数据集上均能实现卓越性能。特别地,我们的方法在F1分数和准确率上分别以最高71%和35%的优势超越最佳基线模型。


Pitfalls of Rule- and Model-based Verifiers -- A Case Study on Mathematical Reasoning

Abstract

arXiv:2505.22203v1 Announce Type: cross Abstract: Trustworthy verifiers are essential for the success of reinforcement learning with verifiable reward (RLVR), which is the core methodology behind various large reasoning models such as DeepSeek-R1. In complex domains like mathematical reasoning, rule-based verifiers have been widely adopted in previous works to train strong reasoning models. However, the reliability of these verifiers and their impact on the RL training process remain poorly understood. In this work, we take mathematical reasoning as a case study and conduct a comprehensive analysis of various verifiers in both static evaluation and RL training scenarios. First, we find that current open-source rule-based verifiers often fail to recognize equivalent answers presented in different formats across multiple commonly used mathematical datasets, resulting in non-negligible false negative rates. This limitation adversely affects RL training performance and becomes more pronounced as the policy model gets stronger. Subsequently, we investigate model-based verifiers as a potential solution to address these limitations. While the static evaluation shows that model-based verifiers achieve significantly higher verification accuracy, further analysis and RL training results imply that they are highly susceptible to hacking, where they misclassify certain patterns in responses as correct (i.e., false positives). This vulnerability is exploited during policy model optimization, leading to artificially inflated rewards. Our findings underscore the unique risks inherent to both rule-based and model-based verifiers, aiming to offer valuable insights to develop more robust reward systems in reinforcement learning.

摘要

可信的验证器对于可验证奖励的强化学习(RLVR)的成功至关重要,这是DeepSeek-R1等各类大型推理模型背后的核心方法。在数学推理等复杂领域中,基于规则的验证器已在先前研究中被广泛用于训练强大的推理模型。然而,这些验证器的可靠性及其对强化学习训练过程的影响仍鲜为人知。本研究以数学推理为例,在静态评估和强化学习训练场景中对各类验证器进行了全面分析。首先,我们发现当前开源的基于规则的验证器在多个常用数学数据集上,往往无法识别不同格式表达的等效答案,导致不可忽视的假阴性率。这一缺陷会对强化学习训练性能产生负面影响,且随着策略模型的增强而愈发显著。随后,我们研究了基于模型的验证器作为解决这些局限的潜在方案。虽然静态评估表明基于模型的验证器能达到显著更高的验证准确率,但进一步分析和强化学习训练结果表明,它们极易受到攻击,即错误地将响应中的某些模式分类为正确(即假阳性)。这种漏洞在策略模型优化过程中被利用,导致奖励被人为夸大。我们的研究结果揭示了基于规则和基于模型的验证器各自固有的独特风险,旨在为开发更鲁棒的强化学习奖励系统提供有价值的见解。


Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

Abstract

arXiv:2505.22232v1 Announce Type: cross Abstract: High-quality multilingual training data is essential for effectively pretraining large language models (LLMs). Yet, the availability of suitable open-source multilingual datasets remains limited. Existing state-of-the-art datasets mostly rely on heuristic filtering methods, restricting both their cross-lingual transferability and scalability. Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale while significantly reducing computational demands. JQL distills LLMs' annotation capabilities into lightweight annotators based on pretrained multilingual embeddings. These models exhibit robust multilingual and cross-lingual performance, even for languages and scripts unseen during training. Evaluated empirically across 35 languages, the resulting annotation pipeline substantially outperforms current heuristic filtering methods like Fineweb2. JQL notably enhances downstream model training quality and increases data retention rates. Our research provides practical insights and valuable resources for multilingual data curation, raising the standards of multilingual dataset development.

摘要

高质量多语言训练数据对于有效预训练大语言模型(LLMs)至关重要。然而,现有合适的开源多语言数据集仍然有限。当前最先进的数据集大多依赖启发式过滤方法,这既限制了其跨语言迁移能力,也制约了可扩展性。本文提出JQL——一种系统性方法,能高效构建大规模多样化高质量多语言数据集,同时显著降低计算需求。JQL将LLMs的标注能力蒸馏至基于预训练多语言嵌入的轻量级标注器中。这些模型展现出强大的多语言和跨语言性能,即使对训练时未见的语言和文字体系亦然。通过在35种语言上的实证评估,该标注流程显著优于Fineweb2等现有启发式过滤方法。JQL显著提升了下游模型训练质量并提高了数据保留率。本研究为多语言数据构建提供了实践指导和宝贵资源,提升了多语言数据集开发的标准。


Test-Time Immunization: A Universal Defense Framework Against Jailbreaks for (Multimodal) Large Language Models

Abstract

arXiv:2505.22271v1 Announce Type: cross Abstract: While (multimodal) large language models (LLMs) have attracted widespread attention due to their exceptional capabilities, they remain vulnerable to jailbreak attacks. Various defense methods are proposed to defend against jailbreak attacks, however, they are often tailored to specific types of jailbreak attacks, limiting their effectiveness against diverse adversarial strategies. For instance, rephrasing-based defenses are effective against text adversarial jailbreaks but fail to counteract image-based attacks. To overcome these limitations, we propose a universal defense framework, termed Test-time IMmunization (TIM), which can adaptively defend against various jailbreak attacks in a self-evolving way. Specifically, TIM initially trains a gist token for efficient detection, which it subsequently applies to detect jailbreak activities during inference. When jailbreak attempts are identified, TIM implements safety fine-tuning using the detected jailbreak instructions paired with refusal answers. Furthermore, to mitigate potential performance degradation in the detector caused by parameter updates during safety fine-tuning, we decouple the fine-tuning process from the detection module. Extensive experiments on both LLMs and multimodal LLMs demonstrate the efficacy of TIM.

摘要

尽管(多模态)大语言模型(LLM)因其卓越能力受到广泛关注,它们仍易受越狱攻击影响。现有防御方法虽能抵御特定类型的越狱攻击,却难以应对多样化的对抗策略。例如,基于重述的防御对文本对抗性越狱有效,但无法抵抗基于图像的攻击。为突破这些局限,我们提出一种通用防御框架——测试时免疫(TIM),该框架能以自进化方式自适应防御各类越狱攻击。具体而言,TIM首先生成关键标记以进行高效检测,随后在推理阶段利用该标记识别越狱行为。当检测到越狱尝试时,TIM会使用捕获的越狱指令与拒绝回答进行安全微调。此外,为避免安全微调过程中参数更新导致检测器性能下降,我们将微调过程与检测模块解耦。在大语言模型和多模态大语言模型上的大量实验验证了TIM的有效性。


Let's Predict Sentence by Sentence

Abstract

arXiv:2505.22202v1 Announce Type: cross Abstract: Autoregressive language models (LMs) generate one token at a time, yet human reasoning operates over higher-level abstractions - sentences, propositions, and concepts. This contrast raises a central question- Can LMs likewise learn to reason over structured semantic units rather than raw token sequences? In this work, we investigate whether pretrained LMs can be lifted into such abstract reasoning spaces by building on their learned representations. We present a framework that adapts a pretrained token-level LM to operate in sentence space by autoregressively predicting continuous embeddings of next sentences. We explore two embedding paradigms inspired by classical representation learning: 1) semantic embeddings, learned via autoencoding to preserve surface meaning; and 2) contextual embeddings, trained via next-sentence prediction to encode anticipatory structure. We evaluate both under two inference regimes: Discretized, which decodes each predicted embedding into text before re-encoding; and Continuous, which reasons entirely in embedding space for improved efficiency. Across four domains - mathematics, logic, commonsense, and planning - contextual embeddings under continuous inference show competitive performance with Chain-of-Thought (CoT) while reducing inference-time FLOPs on average by half. We also present early signs of scalability and modular adaptation. Finally, to visualize latent trajectories, we introduce SentenceLens, a diagnostic tool that decodes intermediate model states into interpretable sentences. Together, our results indicate that pretrained LMs can effectively transition to abstract, structured reasoning within latent embedding spaces.

摘要

自回归语言模型(LMs)每次生成一个词元,而人类推理则基于更高层次的抽象单元——句子、命题和概念。这种差异引发了一个核心问题:语言模型能否同样学会在结构化语义单元而非原始词元序列上进行推理?本研究探讨了能否基于预训练语言模型的表征能力,将其提升至此类抽象推理空间。我们提出一个框架,通过自回归预测下一句的连续嵌入向量,将预训练的词元级语言模型适配到句子空间运作。我们探索了两种受经典表征学习启发的嵌入范式:1)语义嵌入,通过自编码学习以保留表层意义;2)上下文嵌入,通过下一句预测训练以编码预期结构。我们在两种推理机制下评估这两种范式:离散化推理(将每个预测嵌入解码为文本后重新编码)和连续推理(完全在嵌入空间中进行以提高效率)。在数学、逻辑、常识和规划四个领域中,连续推理下的上下文嵌入表现出与思维链(CoT)相当的性能,同时平均减少了一半的推理时浮点运算量。我们还展示了初步的可扩展性和模块化适配迹象。最后,为可视化潜在轨迹,我们开发了SentenceLens诊断工具,将中间模型状态解码为可解释的句子。综合结果表明,预训练语言模型能够有效过渡到潜在嵌入空间内的抽象结构化推理。


Text2Grad: Reinforcement Learning from Natural Language Feedback

Abstract

arXiv:2505.22338v1 Announce Type: cross Abstract: Traditional RLHF optimizes language models with coarse, scalar rewards that mask the fine-grained reasons behind success or failure, leading to slow and opaque learning. Recent work augments RL with textual critiques through prompting or reflection, improving interpretability but leaving model parameters untouched. We introduce Text2Grad, a reinforcement-learning paradigm that turns free-form textual feedback into span-level gradients. Given human (or programmatic) critiques, Text2Grad aligns each feedback phrase with the relevant token spans, converts these alignments into differentiable reward signals, and performs gradient updates that directly refine the offending portions of the model's policy. This yields precise, feedback-conditioned adjustments instead of global nudges. Text2Grad is realized through three components: (1) a high-quality feedback-annotation pipeline that pairs critiques with token spans; (2) a fine-grained reward model that predicts span-level reward on answer while generating explanatory critiques; and (3) a span-level policy optimizer that back-propagates natural-language gradients. Across summarization, code generation, and question answering, Text2Grad consistently surpasses scalar-reward RL and prompt-only baselines, providing both higher task metrics and richer interpretability. Our results demonstrate that natural-language feedback, when converted to gradients, is a powerful signal for fine-grained policy optimization. The code for our method is available at https://github.com/microsoft/Text2Grad

摘要

传统RLHF方法通过粗粒度的标量奖励优化语言模型,这种奖励掩盖了成功或失败的细粒度原因,导致学习过程缓慢且不透明。近期研究通过提示或反思机制将文本批评纳入强化学习,虽提升了可解释性但未触及模型参数。我们提出Text2Grad——一种将自由形式文本反馈转化为片段级梯度的强化学习范式。该系统在接收人类(或程序化)批评后,将每个反馈短语与相关标记片段对齐,将这些对齐关系转化为可微分奖励信号,并通过梯度更新直接修正模型策略中的问题部分。该方法实现了基于反馈的精准调整,而非全局微调。Text2Grad由三个核心组件实现:(1)将批评与标记片段配对的高质量反馈标注流程;(2)在预测片段级奖励的同时生成解释性批评的细粒度奖励模型;(3)能够反向传播自然语言梯度的片段级策略优化器。在文本摘要、代码生成和问答任务中,Text2Grad始终优于标量奖励强化学习和仅使用提示的基线方法,既获得更高任务指标又提供更丰富的可解释性。实验结果表明,当自然语言反馈转化为梯度时,能成为细粒度策略优化的强效信号。方法代码已发布于https://github.com/microsoft/Text2Grad。


Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

Abstract

arXiv:2505.22334v1 Announce Type: cross Abstract: Recent advancements in large language models (LLMs) have demonstrated impressive chain-of-thought reasoning capabilities, with reinforcement learning (RL) playing a crucial role in this progress. While "aha moment" patterns--where models exhibit self-correction through reflection--are often attributed to emergent properties from RL, we first demonstrate that these patterns exist in multimodal LLMs (MLLMs) prior to RL training but may not necessarily correlate with improved reasoning performance. Building on these insights, we present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities. Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3 %\rightarrow73.4 % on MathVista, 62.9 %\rightarrow70.4 % on We-Math) and our 3B model achieving performance competitive with several 7B models. Overall, this work provides practical guidance for building advanced multimodal reasoning models. Our code is available at https://github.com/waltonfuture/RL-with-Cold-Start.

摘要

大型语言模型(LLMs)的最新进展展示了令人瞩目的思维链推理能力,其中强化学习(RL)在这一进步中起到了关键作用。尽管“顿悟时刻”模式——即模型通过反思实现自我纠正——常被归因于RL的涌现特性,但我们首先证明了这些模式在RL训练前的多模态LLMs(MLLMs)中就已存在,且未必与推理性能提升相关。基于这些发现,我们提出了一项通过两阶段方法增强多模态推理的综合研究:(1)以结构化思维链推理模式进行监督微调(SFT)作为冷启动,随后(2)通过GRPO进行强化学习以进一步优化这些能力。大量实验表明,这种组合方法在具有挑战性的多模态推理基准测试中持续优于仅使用SFT或仅使用RL的方法。所得到的模型在开源MLLMs中实现了3B和7B规模的最先进性能,其中7B模型相较于基础模型有显著提升(例如MathVista上66.3%→73.4%,We-Math上62.9%→70.4%),而3B模型的性能可与多个7B模型竞争。总体而言,这项工作为构建先进多模态推理模型提供了实用指导。代码发布于https://github.com/waltonfuture/RL-with-Cold-Start。


Skywork Open Reasoner 1 Technical Report

Abstract

arXiv:2505.22312v1 Announce Type: cross Abstract: The success of DeepSeek-R1 underscores the significant role of reinforcement learning (RL) in enhancing the reasoning capabilities of large language models (LLMs). In this work, we present Skywork-OR1, an effective and scalable RL implementation for long Chain-of-Thought (CoT) models. Building on the DeepSeek-R1-Distill model series, our RL approach achieves notable performance gains, increasing average accuracy across AIME24, AIME25, and LiveCodeBench from 57.8% to 72.8% (+15.0%) for the 32B model and from 43.6% to 57.5% (+13.9%) for the 7B model. Our Skywork-OR1-32B model surpasses both DeepSeek-R1 and Qwen3-32B on the AIME24 and AIME25 benchmarks, while achieving comparable results on LiveCodeBench. The Skywork-OR1-7B and Skywork-OR1-Math-7B models demonstrate competitive reasoning capabilities among models of similar size. We perform comprehensive ablation studies on the core components of our training pipeline to validate their effectiveness. Additionally, we thoroughly investigate the phenomenon of entropy collapse, identify key factors affecting entropy dynamics, and demonstrate that mitigating premature entropy collapse is critical for improved test performance. To support community research, we fully open-source our model weights, training code, and training datasets.

摘要

DeepSeek-R1的成功凸显了强化学习(RL)在增强大语言模型(LLMs)推理能力中的重要作用。本研究提出Skywork-OR1,一种针对长链思维(CoT)模型的高效且可扩展的RL实现方案。基于DeepSeek-R1-Distill模型系列,我们的RL方法实现了显著的性能提升:在AIME24、AIME25和LiveCodeBench测试集上,32B模型的平均准确率从57.8%提升至72.8%(+15.0%),7B模型从43.6%提升至57.5%(+13.9%)。Skywork-OR1-32B模型在AIME24和AIME25基准测试中超越DeepSeek-R1和Qwen3-32B,同时在LiveCodeBench上取得可比结果。Skywork-OR1-7B和Skywork-OR1-Math-7B模型在同类尺寸模型中展现出具有竞争力的推理能力。我们通过消融实验验证了训练流程核心组件的有效性,并深入研究了熵崩塌现象,识别出影响熵动态的关键因素,证明缓解过早熵崩塌对提升测试性能至关重要。为支持社区研究,我们完整开源了模型权重、训练代码及训练数据集。


Budget-Adaptive Adapter Tuning in Orthogonal Subspaces for Continual Learning in LLMs

Abstract

arXiv:2505.22358v1 Announce Type: cross Abstract: Large language models (LLMs) often suffer from catastrophic forgetting in continual learning (CL) scenarios, where performance on previously learned tasks degrades severely while training on sequentially arriving tasks. Although pioneering CL approaches using orthogonal subspaces can mitigate task interference, they typically employ fixed budget allocation, neglecting the varying complexity across tasks and layers. Besides, recent budget-adaptive tuning methods for LLMs often adopt multi-stage paradigms that decouple optimization and budget allocation. Such decoupling results in potential misalignment, which hinders those approaches' practical application in CL scenarios. To address these limitations, we propose OA-Adapter, a novel parameter-efficient approach for continual learning in LLMs that unifies dynamic budget adaptation with orthogonal subspace learning in a single end-to-end training stage. Specifically, OA-Adapter introduces a dynamic bottleneck dimension adaptation mechanism that simultaneously allocates an efficient parameter budget and optimizes task objectives without misalignment. To effectively preserve previously acquired knowledge while coordinating with the dynamic budget allocation, orthogonal constraints are applied specifically between the parameter subspace of the current task and the dynamically allocated parameter subspaces of historical tasks. Experimental results on continual learning benchmarks demonstrate that OA-Adapter outperforms state-of-the-art methods in both accuracy and parameter efficiency, achieving higher average accuracy while using 58.5% fewer parameters on the standard CL benchmark.

摘要

大语言模型(LLMs)在持续学习(CL)场景中常遭受灾难性遗忘问题,即在顺序学习新任务时,对已学习任务的性能会急剧下降。尽管现有采用正交子空间的持续学习方法能缓解任务间干扰,但通常采用固定预算分配策略,忽视了不同任务和网络层间的复杂度差异。此外,当前面向LLMs的预算自适应调优方法多采用多阶段范式,将优化过程与预算分配解耦,这种脱节可能导致潜在偏差,阻碍其在持续学习场景的实际应用。针对这些局限性,我们提出OA-Adapter——一种新颖的参数高效持续学习方法,通过端到端单阶段训练将动态预算适配与正交子空间学习相统一。具体而言,OA-Adapter引入动态瓶颈维度适配机制,在避免偏差的同时实现高效参数预算分配与任务目标优化。为有效保留历史知识并与动态预算分配协同工作,该方法专门在当前任务参数子空间与历史任务动态分配参数子空间之间施加正交约束。持续学习基准测试表明,OA-Adapter在准确率和参数效率上均优于现有最优方法,在标准CL基准上以58.5%更少的参数量实现了更高的平均准确率。


Scaling Reasoning without Attention

Abstract

arXiv:2505.22425v1 Announce Type: cross Abstract: Large language models (LLMs) have made significant advances in complex reasoning tasks, yet they remain bottlenecked by two core challenges: architectural inefficiency due to reliance on Transformers, and a lack of structured fine-tuning for high-difficulty domains. We introduce \ourmodel, an attention-free language model that addresses both issues through architectural and data-centric innovations. Built on the state space dual (SSD) layers of Mamba-2, our model eliminates the need for self-attention and key-value caching, enabling fixed-memory, constant-time inference. To train it for complex reasoning, we propose a two-phase curriculum fine-tuning strategy based on the \textsc{PromptCoT} synthesis paradigm, which generates pedagogically structured problems via abstract concept selection and rationale-guided generation. On benchmark evaluations, \ourmodel-7B outperforms strong Transformer and hybrid models of comparable scale, and even surpasses the much larger Gemma3-27B by 2.6% on AIME 24, 0.6% on AIME 25, and 3.0% on Livecodebench. These results highlight the potential of state space models as efficient and scalable alternatives to attention-based architectures for high-capacity reasoning.

摘要

尽管大语言模型(LLMs)在复杂推理任务上取得了显著进展,但仍受限于两大核心挑战:因依赖Transformer架构导致的效率瓶颈,以及缺乏针对高难度领域的结构化微调方法。我们提出\ourmodel——一种无需注意力机制的语言模型,通过架构创新与数据中心的改进同时解决了上述问题。该模型基于Mamba-2的状态空间对偶(SSD)层构建,无需自注意力机制和键值缓存,实现了固定内存消耗的恒定时间推理。为训练其复杂推理能力,我们基于\textsc{PromptCoT}合成范式提出两阶段课程微调策略:通过抽象概念选择与原理引导生成,构建具有教学结构的问题集。基准测试表明,\ourmodel-7B在同等规模下优于强Transformer及混合模型,并在AIME 24、AIME 25和Livecodebench上分别以2.6%、0.6%和3.0%的优势超越规模大得多的Gemma3-27B。这些结果证明了状态空间模型作为高效、可扩展的注意力架构替代方案,在高容量推理任务中的潜力。


Mitigating Overthinking in Large Reasoning Models via Manifold Steering

Abstract

arXiv:2505.22411v1 Announce Type: cross Abstract: Recent advances in Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex tasks such as mathematics and coding. However, these models frequently exhibit a phenomenon known as overthinking during inference, characterized by excessive validation loops and redundant deliberation, leading to substantial computational overheads. In this paper, we aim to mitigate overthinking by investigating the underlying mechanisms from the perspective of mechanistic interpretability. We first showcase that the tendency of overthinking can be effectively captured by a single direction in the model's activation space and the issue can be eased by intervening the activations along this direction. However, this efficacy soon reaches a plateau and even deteriorates as the intervention strength increases. We therefore systematically explore the activation space and find that the overthinking phenomenon is actually tied to a low-dimensional manifold, which indicates that the limited effect stems from the noises introduced by the high-dimensional steering direction. Based on this insight, we propose Manifold Steering, a novel approach that elegantly projects the steering direction onto the low-dimensional activation manifold given the theoretical approximation of the interference noise. Extensive experiments on DeepSeek-R1 distilled models validate that our method reduces output tokens by up to 71% while maintaining and even improving the accuracy on several mathematical benchmarks. Our method also exhibits robust cross-domain transferability, delivering consistent token reduction performance in code generation and knowledge-based QA tasks. Code is available at: https://github.com/Aries-iai/Manifold_Steering.

摘要

大规模推理模型(LRMs)的最新进展在解决数学和编程等复杂任务方面展现出卓越能力。然而这些模型在推理过程中频繁出现"过度思考"现象,表现为过度的验证循环和冗余推演,导致显著的计算开销。本文从机制可解释性角度研究其内在机理以缓解该问题。我们首先证明:模型激活空间中的单一方向可有效捕捉过度思考倾向,通过沿该方向干预激活能缓解问题。但随着干预强度增加,效果很快达到平台期甚至恶化。通过系统探索激活空间,发现过度思考现象实际与一个低维流形相关,表明效果受限源于高维引导方向引入的噪声干扰。基于此,我们提出流形引导法——在理论近似干扰噪声前提下,将引导方向优雅地投影至低维激活流形。在DeepSeek-R1蒸馏模型上的大量实验表明,该方法在保持甚至提升多个数学基准准确率的同时,最高可减少71%的输出标记。该方法还展现出强大的跨领域迁移能力,在代码生成和知识问答任务中均保持稳定的标记缩减性能。代码见:https://github.com/Aries-iai/Manifold_Steering。


Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO

Abstract

arXiv:2505.22453v1 Announce Type: cross Abstract: Improving Multi-modal Large Language Models (MLLMs) in the post-training stage typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). However, these supervised methods require expensive and manually annotated multi-modal data--an ultimately unsustainable resource. While recent efforts have explored unsupervised post-training, their methods are complex and difficult to iterate. In this work, we are the first to investigate the use of GRPO, a stable and scalable online RL algorithm, for enabling continual self-improvement without any external supervision. We propose MM-UPT, a simple yet effective framework for unsupervised post-training of MLLMs. MM-UPT builds upon GRPO, replacing traditional reward signals with a self-rewarding mechanism based on majority voting over multiple sampled responses. Our experiments demonstrate that MM-UPT significantly improves the reasoning ability of Qwen2.5-VL-7B (e.g., 66.3 %\rightarrow72.9 % on MathVista, 62.9 %\rightarrow68.7 % on We-Math), using standard dataset without ground truth labels. MM-UPT also outperforms prior unsupervised baselines and even approaches the results of supervised GRPO. Furthermore, we show that incorporating synthetic questions, generated solely by MLLM itself, can boost performance as well, highlighting a promising approach for scalable self-improvement. Overall, MM-UPT offers a new paradigm for continual, autonomous enhancement of MLLMs in the absence of external supervision. Our code is available at https://github.com/waltonfuture/MM-UPT.

摘要

改进多模态大语言模型(MLLMs)的后训练阶段通常依赖于监督微调(SFT)或强化学习(RL)。然而,这些监督方法需要昂贵且人工标注的多模态数据——这种资源最终难以持续获取。尽管近期研究探索了无监督后训练方法,但其方案复杂且难以迭代。本研究首次探索使用GRPO(一种稳定且可扩展的在线RL算法)实现无需外部监督的持续自我改进。我们提出MM-UPT,这是一个简单而有效的无监督MLLMs后训练框架。MM-UPT基于GRPO构建,用基于多响应样本多数投票的自奖励机制替代传统奖励信号。实验表明,MM-UPT显著提升了Qwen2.5-VL-7B的推理能力(例如MathVista数据集从66.3%提升至72.9%,We-Math数据集从62.9%提升至68.7%),且仅使用无真实标签的标准数据集。MM-UPT不仅优于先前无监督基线,甚至接近监督GRPO的结果。此外,我们发现仅通过MLLM自身生成的合成问题也能提升性能,这为可扩展的自我改进提供了新思路。总体而言,MM-UPT为无外部监督环境下MLLMs的持续自主增强提供了新范式。代码已开源:https://github.com/waltonfuture/MM-UPT。


Fostering Video Reasoning via Next-Event Prediction

Abstract

arXiv:2505.22457v1 Announce Type: cross Abstract: Next-token prediction serves as the foundational learning task enabling reasoning in LLMs. But what should the learning task be when aiming to equip MLLMs with temporal reasoning capabilities over video inputs? Existing tasks such as video question answering often rely on annotations from humans or much stronger MLLMs, while video captioning tends to entangle temporal reasoning with spatial information. To address this gap, we propose next-event prediction (NEP), a learning task that harnesses future video segments as a rich, self-supervised signal to foster temporal reasoning. We segment each video into past and future frames: the MLLM takes the past frames as input and predicts a summary of events derived from the future frames, thereby encouraging the model to reason temporally in order to complete the task. To support this task, we curate V1-33K, a dataset comprising 33,000 automatically extracted video segments spanning diverse real-world scenarios. We further explore a range of video instruction-tuning strategies to study their effects on temporal reasoning. To evaluate progress, we introduce FutureBench to assess coherence in predicting unseen future events. Experiments validate that NEP offers a scalable and effective training paradigm for fostering temporal reasoning in MLLMs.

摘要

下一代标记预测作为基础学习任务,使大型语言模型具备推理能力。但当目标是为多模态大语言模型赋予视频输入的时间推理能力时,应采用何种学习任务?现有任务如视频问答通常依赖人类或更强大多模态大模型的标注,而视频描述则往往将时间推理与空间信息混为一谈。为填补这一空白,我们提出下一代事件预测(NEP),该学习任务利用未来视频片段作为丰富的自监督信号来培养时间推理能力。我们将每个视频分割为过去帧和未来帧:多模态大模型以过去帧作为输入,预测从未来帧提取的事件摘要,从而促使模型通过时间推理完成任务。为支持该任务,我们构建了V1-33K数据集,包含33,000个自动提取的涵盖多样化现实场景的视频片段。我们进一步探索多种视频指令微调策略以研究其对时间推理的影响。为评估进展,我们提出FutureBench来度量预测未见未来事件的连贯性。实验验证表明,下一代事件预测为培养多模态大模型的时间推理能力提供了可扩展且有效的训练范式。


From Strangers to Assistants: Fast Desire Alignment for Embodied Agent-User Adaptation

Abstract

arXiv:2505.22503v1 Announce Type: cross Abstract: While embodied agents have made significant progress in performing complex physical tasks, real-world applications demand more than pure task execution. The agents must collaborate with unfamiliar agents and human users, whose goals are often vague and implicit. In such settings, interpreting ambiguous instructions and uncovering underlying desires is essential for effective assistance. Therefore, fast and accurate desire alignment becomes a critical capability for embodied agents. In this work, we first develop a home assistance simulation environment HA-Desire that integrates an LLM-driven human user agent exhibiting realistic value-driven goal selection and communication. The ego agent must interact with this proxy user to infer and adapt to the user's latent desires. To achieve this, we present a novel framework FAMER for fast desire alignment, which introduces a desire-based mental reasoning mechanism to identify user intent and filter desire-irrelevant actions. We further design a reflection-based communication module that reduces redundant inquiries, and incorporate goal-relevant information extraction with memory persistence to improve information reuse and reduce unnecessary exploration. Extensive experiments demonstrate that our framework significantly enhances both task execution and communication efficiency, enabling embodied agents to quickly adapt to user-specific desires in complex embodied environments.

摘要

尽管具身智能体在执行复杂物理任务方面已取得显著进展,但现实应用需求远超越单纯的任务执行。智能体必须与陌生智能体及人类用户协作,而用户目标往往模糊且隐含。在此类场景中,解析模糊指令并揭示潜在需求是提供有效协助的关键。因此,快速精准的需求对齐成为具身智能体的核心能力。本研究首先构建了家庭辅助仿真环境HA-Desire,其整合了由大语言模型驱动的人类用户代理,能呈现基于真实价值观的目标选择与沟通行为。主控智能体需通过与代理用户交互来推断并适应用户的潜在需求。为此,我们提出新型快速需求对齐框架FAMER,引入基于需求的心理推理机制以识别用户意图,并过滤与需求无关的行为。进一步设计基于反思的通信模块以减少冗余询问,同时结合目标相关信息提取与记忆持久化机制,提升信息复用率并降低无效探索。大量实验表明,该框架显著提升了任务执行与通信效率,使具身智能体能在复杂环境中快速适应用户特定需求。


Thinking with Generated Images

Abstract

arXiv:2505.22525v1 Announce Type: cross Abstract: We present Thinking with Generated Images, a novel paradigm that fundamentally transforms how large multimodal models (LMMs) engage with visual reasoning by enabling them to natively think across text and vision modalities through spontaneous generation of intermediate visual thinking steps. Current visual reasoning with LMMs is constrained to either processing fixed user-provided images or reasoning solely through text-based chain-of-thought (CoT). Thinking with Generated Images unlocks a new dimension of cognitive capability where models can actively construct intermediate visual thoughts, critique their own visual hypotheses, and refine them as integral components of their reasoning process. We demonstrate the effectiveness of our approach through two complementary mechanisms: (1) vision generation with intermediate visual subgoals, where models decompose complex visual tasks into manageable components that are generated and integrated progressively, and (2) vision generation with self-critique, where models generate an initial visual hypothesis, analyze its shortcomings through textual reasoning, and produce refined outputs based on their own critiques. Our experiments on vision generation benchmarks show substantial improvements over baseline approaches, with our models achieving up to 50% (from 38% to 57%) relative improvement in handling complex multi-object scenarios. From biochemists exploring novel protein structures, and architects iterating on spatial designs, to forensic analysts reconstructing crime scenes, and basketball players envisioning strategic plays, our approach enables AI models to engage in the kind of visual imagination and iterative refinement that characterizes human creative, analytical, and strategic thinking. We release our open-source suite at https://github.com/GAIR-NLP/thinking-with-generated-images.

摘要

我们提出"生成式图像思维"这一新范式,它通过让大型多模态模型(LMMs)自发生成中间视觉思维步骤,从根本上改变了模型在视觉推理过程中处理文本与视觉模态的方式。当前LMMs的视觉推理要么局限于处理用户提供的固定图像,要么仅通过基于文本的思维链(CoT)进行推理。生成式图像思维解锁了新的认知维度,使模型能够主动构建中间视觉思维、批判自身的视觉假设,并将这些过程作为推理的有机组成部分。我们通过两种互补机制验证方法的有效性:(1)具有中间视觉子目标的图像生成,模型将复杂视觉任务分解为可管理的组件进行渐进式生成与整合;(2)具有自我批判的图像生成,模型首先生成初始视觉假设,通过文本推理分析其缺陷,继而基于自我批判生成优化结果。在视觉生成基准测试中,我们的方法相较基线模型取得显著提升,处理复杂多对象场景时相对改进幅度最高达50%(从38%提升至57%)。从探索新型蛋白质结构的生物化学家、迭代空间设计的建筑师,到重建犯罪现场的刑侦分析师、构想战术配合的篮球运动员,我们的方法使AI模型能够进行类人的视觉想象与迭代优化——这种能力正是人类创造性、分析性与策略性思维的特征。


Agent-UniRAG: A Trainable Open-Source LLM Agent Framework for Unified Retrieval-Augmented Generation Systems

Abstract

arXiv:2505.22571v1 Announce Type: cross Abstract: This paper presents a novel approach for unified retrieval-augmented generation (RAG) systems using the recent emerging large language model (LLM) agent concept. Specifically, Agent LLM, which utilizes LLM as fundamental controllers, has become a promising approach to enable the interpretability of RAG tasks, especially for complex reasoning question-answering systems (e.g., multi-hop queries). Nonetheless, previous works mainly focus on solving RAG systems with either single-hop or multi-hop approaches separately, which limits the application of those approaches to real-world applications. In this study, we propose a trainable agent framework called Agent-UniRAG for unified retrieval-augmented LLM systems, which enhances the effectiveness and interpretability of RAG systems. The main idea is to design an LLM agent framework to solve RAG tasks step-by-step based on the complexity of the inputs, simultaneously including single-hop and multi-hop queries in an end-to-end manner. Furthermore, we introduce SynAgent-RAG, a synthetic dataset to enable the proposed agent framework for small open-source LLMs (e.g., Llama-3-8B). The results show comparable performances with closed-source and larger open-source LLMs across various RAG benchmarks. Our source code and dataset are publicly available for further exploitation.

摘要

本文提出了一种基于新兴大语言模型(LLM)智能体概念的统一检索增强生成(RAG)系统新方法。具体而言,利用LLM作为基础控制器的智能体LLM,已成为实现RAG任务可解释性的有效途径,尤其适用于复杂推理问答系统(如多跳查询)。然而,现有研究主要集中于分别解决单跳或多跳RAG系统,这限制了这些方法在实际应用中的适用性。本研究提出名为Agent-UniRAG的可训练智能体框架,用于统一检索增强的LLM系统,旨在提升RAG系统的效能与可解释性。其核心思想是设计一个LLM智能体框架,根据输入复杂度逐步解决RAG任务,以端到端方式同时处理单跳和多跳查询。此外,我们引入了SynAgent-RAG合成数据集,使该框架能适配小型开源LLM(如Llama-3-8B)。实验结果表明,该方法在各种RAG基准测试中与闭源及更大规模开源LLM具有可比性能。我们的源代码和数据集已公开以供进一步研究。


ClaimPKG: Enhancing Claim Verification via Pseudo-Subgraph Generation with Lightweight Specialized LLM

Abstract

arXiv:2505.22552v1 Announce Type: cross Abstract: Integrating knowledge graphs (KGs) to enhance the reasoning capabilities of large language models (LLMs) is an emerging research challenge in claim verification. While KGs provide structured, semantically rich representations well-suited for reasoning, most existing verification methods rely on unstructured text corpora, limiting their ability to effectively leverage KGs. Additionally, despite possessing strong reasoning abilities, modern LLMs struggle with multi-step modular pipelines and reasoning over KGs without adaptation. To address these challenges, we propose ClaimPKG, an end-to-end framework that seamlessly integrates LLM reasoning with structured knowledge from KGs. Specifically, the main idea of ClaimPKG is to employ a lightweight, specialized LLM to represent the input claim as pseudo-subgraphs, guiding a dedicated subgraph retrieval module to identify relevant KG subgraphs. These retrieved subgraphs are then processed by a general-purpose LLM to produce the final verdict and justification. Extensive experiments on the FactKG dataset demonstrate that ClaimPKG achieves state-of-the-art performance, outperforming strong baselines in this research field by 9%-12% accuracy points across multiple categories. Furthermore, ClaimPKG exhibits zero-shot generalizability to unstructured datasets such as HoVer and FEVEROUS, effectively combining structured knowledge from KGs with LLM reasoning across various LLM backbones.

摘要

如何整合知识图谱(KGs)以增强大语言模型(LLMs)的推理能力,是声明验证领域新兴的研究挑战。尽管知识图谱提供了适合推理的结构化、语义丰富的表示形式,但现有验证方法大多依赖非结构化文本语料库,限制了其有效利用知识图谱的能力。此外,尽管现代大语言模型具备强大的推理能力,但在未经适配的情况下,仍难以处理多步骤模块化流程及基于知识图谱的推理。为解决这些挑战,我们提出了ClaimPKG框架,该端到端系统将大语言模型推理与知识图谱的结构化知识无缝整合。具体而言,ClaimPKG的核心思想是采用轻量级专用大语言模型将输入声明表示为伪子图,引导专用子图检索模块识别相关知识图谱子图。这些检索到的子图随后由通用大语言模型处理,生成最终判定结果及论证依据。在FactKG数据集上的大量实验表明,ClaimPKG实现了最先进的性能,在多个类别中比该研究领域的强基线模型准确率高出9%-12%。此外,ClaimPKG对HoVer和FEVEROUS等非结构化数据集展现出零样本泛化能力,能有效结合不同大语言模型架构下知识图谱的结构化知识与模型推理能力。


Universal Visuo-Tactile Video Understanding for Embodied Interaction

Abstract

arXiv:2505.22566v1 Announce Type: cross Abstract: Tactile perception is essential for embodied agents to understand physical attributes of objects that cannot be determined through visual inspection alone. While existing approaches have made progress in visual and language modalities for physical understanding, they fail to effectively incorporate tactile information that provides crucial haptic feedback for real-world interaction. In this paper, we present VTV-LLM, the first multi-modal large language model for universal Visuo-Tactile Video (VTV) understanding that bridges the gap between tactile perception and natural language. To address the challenges of cross-sensor and cross-modal integration, we contribute VTV150K, a comprehensive dataset comprising 150,000 video frames from 100 diverse objects captured across three different tactile sensors (GelSight Mini, DIGIT, and Tac3D), annotated with four fundamental tactile attributes (hardness, protrusion, elasticity, and friction). We develop a novel three-stage training paradigm that includes VTV enhancement for robust visuo-tactile representation, VTV-text alignment for cross-modal correspondence, and text prompt finetuning for natural language generation. Our framework enables sophisticated tactile reasoning capabilities including feature assessment, comparative analysis, scenario-based decision making and so on. Experimental evaluations demonstrate that VTV-LLM achieves superior performance in tactile video understanding tasks, establishing a foundation for more intuitive human-machine interaction in tactile domains.

摘要

触觉感知对于具身智能体理解无法仅通过视觉检查确定的物体物理属性至关重要。尽管现有方法在视觉与语言模态的物理理解方面取得了进展,但未能有效整合为真实世界交互提供关键触觉反馈的触觉信息。本文提出VTV-LLM——首个用于通用视觉-触觉视频(VTV)理解的多模态大语言模型,该模型弥合了触觉感知与自然语言之间的鸿沟。针对跨传感器与跨模态融合的挑战,我们贡献了VTV150K数据集,包含来自100种不同物体的15万帧视频数据,通过三种触觉传感器(GelSight Mini、DIGIT和Tac3D)采集,并标注硬度、凸起度、弹性和摩擦力四项基础触觉属性。我们开发了新颖的三阶段训练范式:通过VTV增强实现鲁棒的视觉-触觉表征学习,通过VTV-文本对齐建立跨模态对应关系,以及通过文本提示微调优化自然语言生成。该框架支持包括特征评估、对比分析、场景决策等复杂触觉推理能力。实验评估表明,VTV-LLM在触觉视频理解任务中表现优异,为触觉领域更直观的人机交互奠定了基础。


Self-Error-Instruct: Generalizing from Errors for LLMs Mathematical Reasoning

Abstract

arXiv:2505.22591v1 Announce Type: cross Abstract: Although large language models demonstrate strong performance across various domains, they still struggle with numerous bad cases in mathematical reasoning. Previous approaches to learning from errors synthesize training data by solely extrapolating from isolated bad cases, thereby failing to generalize the extensive patterns inherent within these cases. This paper presents Self-Error-Instruct (SEI), a framework that addresses these model weaknesses and synthesizes more generalized targeted training data. Specifically, we explore a target model on two mathematical datasets, GSM8K and MATH, to pinpoint bad cases. Then, we generate error keyphrases for these cases based on the instructor model's (GPT-4o) analysis and identify error types by clustering these keyphrases. Next, we sample a few bad cases during each generation for each identified error type and input them into the instructor model, which synthesizes additional training data using a self-instruct approach. This new data is refined through a one-shot learning process to ensure that only the most effective examples are kept. Finally, we use these curated data to fine-tune the target model, iteratively repeating the process to enhance performance. We apply our framework to various models and observe improvements in their reasoning abilities across both in-domain and out-of-domain mathematics datasets. These results demonstrate the effectiveness of self-error instruction in improving LLMs' mathematical reasoning through error generalization.

摘要

尽管大语言模型在多个领域展现出强大的性能,其在数学推理方面仍存在大量错误案例。先前从错误中学习的方法仅通过孤立错误案例的外推来合成训练数据,未能泛化这些案例中蕴含的广泛模式。本文提出自错误指导框架(SEI),该框架通过识别模型弱点并合成更具泛化性的定向训练数据来解决问题。具体而言,我们在GSM8K和MATH两个数学数据集上对目标模型进行探索以定位错误案例,随后基于指导模型(GPT-4o)的分析生成这些案例的错误关键词,并通过聚类这些关键词识别错误类型。接着,我们在每次生成过程中为每个已识别的错误类型采样少量错误案例输入指导模型,该模型采用自指导方法合成额外训练数据。这些新数据通过单样本学习过程进行精炼,仅保留最有效的示例。最后,我们使用这些精选数据对目标模型进行微调,并迭代重复该过程以提升性能。我们将该框架应用于多个模型,观察到其在领域内和跨领域数学数据集上推理能力的提升。这些结果证明了通过错误泛化进行自错误指导对提升大语言模型数学推理能力的有效性。


Spatial Knowledge Graph-Guided Multimodal Synthesis

Abstract

arXiv:2505.22633v1 Announce Type: cross Abstract: Recent advances in multimodal large language models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. In this work, we introduce SKG2Data, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2Data automatically constructs a Spatial Knowledge Graph (SKG) to emulate human-like perception of spatial directions and distances, which is subsequently utilized to guide multimodal data synthesis. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, not only enhance the spatial perception and reasoning abilities of MLLMs but also exhibit strong generalization capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence.

摘要

尽管多模态大语言模型(MLLMs)近期取得了显著进展,但其空间感知能力仍存在明显局限。为解决这一挑战,多模态数据合成提供了一种可行方案。然而,确保合成数据符合空间常识是一项重要挑战。本研究提出SKG2Data——一种基于空间知识图谱引导的新型多模态合成方法,其核心思想是通过知识生成数据。该方法通过自动构建空间知识图谱(SKG)来模拟人类对空间方向与距离的认知,并以此指导多模态数据合成。大量实验表明,基于方向、距离等多样化空间知识合成的数据,不仅能有效提升MLLMs的空间感知与推理能力,还展现出强大的泛化性能。我们期望这种基于知识的数据合成思路能推动空间智能的发展。


RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

Abstract

arXiv:2505.22613v1 Announce Type: cross Abstract: Image recaptioning is widely used to generate training datasets with enhanced quality for various multimodal tasks. Existing recaptioning methods typically rely on powerful multimodal large language models (MLLMs) to enhance textual descriptions, but often suffer from inaccuracies due to hallucinations and incompleteness caused by missing fine-grained details. To address these limitations, we propose RICO, a novel framework that refines captions through visual reconstruction. Specifically, we leverage a text-to-image model to reconstruct a caption into a reference image, and prompt an MLLM to identify discrepancies between the original and reconstructed images to refine the caption. This process is performed iteratively, further progressively promoting the generation of more faithful and comprehensive descriptions. To mitigate the additional computational cost induced by the iterative process, we introduce RICO-Flash, which learns to generate captions like RICO using DPO. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness, outperforms most baselines by approximately 10% on both CapsBench and CompreCap. Code released at https://github.com/wangyuchi369/RICO.

摘要

图像重描述技术被广泛用于为多模态任务生成质量更高的训练数据集。现有重描述方法通常依赖强大的多模态大语言模型(MLLM)来增强文本描述,但常因幻觉现象和缺失细粒度细节而导致描述不准确或不完整。为解决这些局限,我们提出RICO框架——通过视觉重构优化描述的创新方法。具体而言,我们利用文本生成图像模型将描述重构成参考图像,并提示MLLM识别原始图像与重构图像间的差异以优化描述。该过程通过迭代执行,逐步生成更忠实且全面的描述。为降低迭代过程带来的额外计算成本,我们进一步提出RICO-Flash,其通过DPO学习生成与RICO质量相当的描述。大量实验表明,本方法显著提升了描述的准确性和完整性,在CapsBench和CompreCap基准上均以约10%的优势超越多数基线模型。代码已发布于https://github.com/wangyuchi369/RICO。


Fusion Steering: Prompt-Specific Activation Control

Abstract

arXiv:2505.22572v1 Announce Type: cross Abstract: We present Fusion Steering, an activation steering methodology that improves factual accuracy in large language models (LLMs) for question-answering (QA) tasks. This approach introduces flexible steering configurations, including full-layer steering and segmented steering. Unlike traditional methods constrained to single-layer or fixed-layer operations, Fusion Steering employs dynamic injection of prompt-specific activation deltas across all transformer layers. These activation deltas are derived from reference completions that combine the ground-truth answer with a model-generated explanation to facilitate semantically enriched, example-specific steering. The injection weights are optimized per prompt using Optuna, targeting a joint objective that balances token overlap (factual alignment) and perplexity (fluency proxy). Evaluation employs a composite score integrating token overlap and LLM-graded quality, encompassing factual accuracy, coherence, and relevance. Empirical results on 260 SimpleQA prompts (selected from 500 where the baseline failed) showcase the efficacy of segmented steering. Using Gemma-2-2B-IT with 8-bit quantization, segmented steering achieves an accuracy of 25.4% (outputs scoring 0.6\geq 0.6), outperforming the baseline at 3.5% and full-layer steering at 16.2%. Under the stricter SimpleQA rubric, segmented steering boosts fully correct responses from 0.0% to 13.1%. These findings highlight the strengths of segmented, dynamic intervention strategies and the promise of per-prompt, full-network activation control. Fusion Steering is also amenable to sparse representations, such as Neuronpedia or sparse crosscoders, suggesting a promising direction for interpretable and scalable activation-level control in LLMs.

摘要

我们提出"融合导向"(Fusion Steering)这一激活导向方法,旨在提升大语言模型(LLMs)在问答任务中的事实准确性。该方法引入灵活的导向配置,包括全层导向和分段导向。与传统局限于单层或固定层操作的方法不同,融合导向通过动态注入跨所有Transformer层的提示特异性激活增量来实现优化。这些激活增量源自参考补全结果——将真实答案与模型生成解释相结合,以促进语义增强的、示例特定的导向。注入权重通过Optuna针对每个提示进行优化,以实现平衡词元重叠(事实对齐)和困惑度(流畅度代理)的联合目标。评估采用综合评分体系,整合词元重叠和LLM评定的质量指标(包括事实准确性、连贯性和相关性)。在260个SimpleQA提示(选自基线失败的500个案例)上的实证结果表明:采用8位量化的Gemma-2-2B-IT模型时,分段导向达到25.4%的准确率(输出评分≥0.6),显著优于基线方法的3.5%和全层导向的16.2%。在更严格的SimpleQA标准下,分段导向将完全正确答案率从0.0%提升至13.1%。这些发现凸显了分段式动态干预策略的优势,以及基于提示的全网络激活控制的应用前景。融合导向还可兼容稀疏表示(如Neuronpedia或稀疏交叉编码器),为LLMs中可解释且可扩展的激活级控制指明了新方向。


Learning Composable Chains-of-Thought

Abstract

arXiv:2505.22635v1 Announce Type: cross Abstract: A common approach for teaching large language models (LLMs) to reason is to train on chain-of-thought (CoT) traces of in-distribution reasoning problems, but such annotated data is costly to obtain for every problem of interest. We want reasoning models to generalize beyond their training distribution, and ideally to generalize compositionally: combine atomic reasoning skills to solve harder, unseen reasoning tasks. We take a step towards compositional generalization of reasoning skills when addressing a target compositional task that has no labeled CoT data. We find that simply training models on CoT data of atomic tasks leads to limited generalization, but minimally modifying CoT formats of constituent atomic tasks to be composable can lead to improvements. We can train "atomic CoT" models on the atomic tasks with Composable CoT data and combine them with multitask learning or model merging for better zero-shot performance on the target compositional task. Such a combined model can be further bootstrapped on a small amount of compositional data using rejection sampling fine-tuning (RFT). Results on string operations and natural language skill compositions show that training LLMs on Composable CoT outperforms multitask learning and continued fine-tuning baselines within a given training data budget.

摘要

当前教导大型语言模型(LLM)进行推理的常见方法是在分布内推理问题的思维链(CoT)标注数据上进行训练,但此类标注数据对于每个目标问题的获取成本高昂。我们希望推理模型能够超越其训练分布进行泛化,理想情况下能实现组合式泛化:通过结合原子推理技能来解决更复杂、未见过的推理任务。本文针对无标注CoT数据的目标组合任务,在推理技能的组合泛化方面迈出了探索步伐。研究发现,仅在原子任务的CoT数据上训练模型会导致泛化能力有限,但通过对构成原子任务的CoT格式进行最小化修改使其具备可组合性,即可带来性能提升。我们可以在原子任务上使用"可组合CoT"数据训练原子模型,并通过多任务学习或模型融合技术提升目标组合任务的零样本性能。此类组合模型还可利用拒绝采样微调(RFT)在少量组合数据上进行自举优化。在字符串操作和自然语言技能组合任务上的实验表明,在给定训练数据预算下,采用可组合CoT训练的LLM模型表现优于多任务学习和持续微调基线方法。


The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Abstract

arXiv:2505.22617v1 Announce Type: cross Abstract: This paper aims to overcome a major obstacle in scaling RL for reasoning with LLMs, namely the collapse of policy entropy. Such phenomenon is consistently observed across vast RL runs without entropy intervention, where the policy entropy dropped sharply at the early training stage, this diminished exploratory ability is always accompanied with the saturation of policy performance. In practice, we establish a transformation equation R=-a*e^H+b between entropy H and downstream performance R. This empirical law strongly indicates that, the policy performance is traded from policy entropy, thus bottlenecked by its exhaustion, and the ceiling is fully predictable H=0, R=-a+b. Our finding necessitates entropy management for continuous exploration toward scaling compute for RL. To this end, we investigate entropy dynamics both theoretically and empirically. Our derivation highlights that, the change in policy entropy is driven by the covariance between action probability and the change in logits, which is proportional to its advantage when using Policy Gradient-like algorithms. Empirical study shows that, the values of covariance term and entropy differences matched exactly, supporting the theoretical conclusion. Moreover, the covariance term stays mostly positive throughout training, further explaining why policy entropy would decrease monotonically. Through understanding the mechanism behind entropy dynamics, we motivate to control entropy by restricting the update of high-covariance tokens. Specifically, we propose two simple yet effective techniques, namely Clip-Cov and KL-Cov, which clip and apply KL penalty to tokens with high covariances respectively. Experiments show that these methods encourage exploration, thus helping policy escape entropy collapse and achieve better downstream performance.

摘要

本文旨在解决大型语言模型(LLM)强化学习规模化应用中的核心障碍——策略熵坍塌现象。该现象在大量未经熵干预的强化学习实验中持续出现:策略熵在训练初期急剧下降,这种探索能力衰减总是伴随着策略性能的停滞。实践中,我们建立了熵值H与下游性能R的转换方程R=-a*e^H+b。该经验法则强有力地表明:策略性能是以策略熵为代价获取的,因此受限于熵的耗尽,其理论上限完全可预测(H=0时R=-a+b)。这一发现表明必须通过熵管理来实现持续探索,以支持强化学习的算力扩展。为此,我们从理论与实证两方面研究了熵动力学机制。理论推导揭示:策略熵的变化由动作概率与对数几率变化量的协方差驱动,当使用类策略梯度算法时,该协方差与优势函数成正比。实证研究表明协方差项与熵差值完全吻合,验证了理论结论。此外,协方差项在训练过程中始终保持正值,进一步解释了策略熵单调下降的原因。基于对熵动力学机制的理解,我们提出通过限制高协方差标记的更新来控制熵值。具体而言,我们开发了两种简单有效的方法:Clip-Cov和KL-Cov,分别对高协方差标记进行截断处理和应用KL惩罚。实验证明这些方法能有效促进探索,帮助策略规避熵坍塌并提升下游任务表现。


3DLLM-Mem: Long-Term Spatial-Temporal Memory for Embodied 3D Large Language Model

Abstract

arXiv:2505.22657v1 Announce Type: cross Abstract: Humans excel at performing complex tasks by leveraging long-term memory across temporal and spatial experiences. In contrast, current Large Language Models (LLMs) struggle to effectively plan and act in dynamic, multi-room 3D environments. We posit that part of this limitation is due to the lack of proper 3D spatial-temporal memory modeling in LLMs. To address this, we first introduce 3DMem-Bench, a comprehensive benchmark comprising over 26,000 trajectories and 2,892 embodied tasks, question-answering and captioning, designed to evaluate an agent's ability to reason over long-term memory in 3D environments. Second, we propose 3DLLM-Mem, a novel dynamic memory management and fusion model for embodied spatial-temporal reasoning and actions in LLMs. Our model uses working memory tokens, which represents current observations, as queries to selectively attend to and fuse the most useful spatial and temporal features from episodic memory, which stores past observations and interactions. Our approach allows the agent to focus on task-relevant information while maintaining memory efficiency in complex, long-horizon environments. Experimental results demonstrate that 3DLLM-Mem achieves state-of-the-art performance across various tasks, outperforming the strongest baselines by 16.5% in success rate on 3DMem-Bench's most challenging in-the-wild embodied tasks.

摘要

人类擅长通过利用跨时空经验的长时记忆来完成复杂任务。相比之下,当前的大型语言模型(LLMs)在动态、多房间的3D环境中难以有效规划和行动。我们认为这种局限性部分源于LLMs缺乏适当的3D时空记忆建模。为此,我们首先提出了3DMem-Bench——一个包含超过26,000条轨迹和2,892项具身任务的综合基准测试,涵盖问答和描述任务,旨在评估智能体在3D环境中进行长时记忆推理的能力。其次,我们提出3DLLM-Mem,这是一种新颖的动态记忆管理与融合模型,用于LLMs中的具身时空推理与行动。该模型将代表当前观察的工作记忆标记作为查询,选择性地关注并融合来自情景记忆(存储过去观察与交互)中最有用的时空特征。我们的方法使智能体能够在复杂、长周期的环境中专注于任务相关信息,同时保持记忆效率。实验结果表明,3DLLM-Mem在各种任务中实现了最先进的性能,在3DMem-Bench最具挑战性的真实场景具身任务上以16.5%的成功率优势超越了最强基线模型。


Position: Uncertainty Quantification Needs Reassessment for Large-language Model Agents

Abstract

arXiv:2505.22655v1 Announce Type: cross Abstract: Large-language models (LLMs) and chatbot agents are known to provide wrong outputs at times, and it was recently found that this can never be fully prevented. Hence, uncertainty quantification plays a crucial role, aiming to quantify the level of ambiguity in either one overall number or two numbers for aleatoric and epistemic uncertainty. This position paper argues that this traditional dichotomy of uncertainties is too limited for the open and interactive setup that LLM agents operate in when communicating with a user, and that we need to research avenues that enrich uncertainties in this novel scenario. We review the literature and find that popular definitions of aleatoric and epistemic uncertainties directly contradict each other and lose their meaning in interactive LLM agent settings. Hence, we propose three novel research directions that focus on uncertainties in such human-computer interactions: Underspecification uncertainties, for when users do not provide all information or define the exact task at the first go, interactive learning, to ask follow-up questions and reduce the uncertainty about the current context, and output uncertainties, to utilize the rich language and speech space to express uncertainties as more than mere numbers. We expect that these new ways of dealing with and communicating uncertainties will lead to LLM agent interactions that are more transparent, trustworthy, and intuitive.

摘要

众所周知,大语言模型(LLMs)和聊天机器人代理有时会提供错误输出,且近期研究发现这种现象无法完全避免。因此,不确定性量化至关重要,其目标是通过一个总体数值或分别表示偶然不确定性和认知不确定性的两个数值来量化模糊程度。本立场论文认为,传统的不确定性二分法在大语言模型代理与用户交互的开放环境中过于局限,需要探索新途径以丰富这一新型场景中的不确定性研究。通过文献综述,我们发现流行的偶然不确定性和认知不确定性定义在交互式大语言模型代理场景下相互矛盾且失去意义。为此,我们提出三个针对人机交互中不确定性的新研究方向:未明确性不确定性(适用于用户未一次性提供全部信息或明确定义任务的情况)、交互式学习(通过追问后续问题降低当前情境的不确定性)以及输出不确定性(利用丰富的语言和语音空间将不确定性表达为超越单纯数字的形式)。我们预期这些处理与传达不确定性的新方法将使大语言模型代理的交互更具透明度、可信度和直观性。


Maximizing Confidence Alone Improves Reasoning

Abstract

arXiv:2505.22660v1 Announce Type: cross Abstract: Reinforcement learning (RL) has enabled machine learning models to achieve significant advances in many fields. Most recently, RL has empowered frontier language models to solve challenging math, science, and coding problems. However, central to any RL algorithm is the reward function, and reward engineering is a notoriously difficult problem in any domain. In this paper, we propose RENT: Reinforcement Learning via Entropy Minimization -- a fully unsupervised RL method that requires no external reward or ground-truth answers, and instead uses the model's entropy of its underlying distribution as an intrinsic reward. We find that by reinforcing the chains of thought that yield high model confidence on its generated answers, the model improves its reasoning ability. In our experiments, we showcase these improvements on an extensive suite of commonly-used reasoning benchmarks, including GSM8K, MATH500, AMC, AIME, and GPQA, and models of varying sizes from the Qwen and Mistral families. The generality of our unsupervised learning method lends itself to applicability in a wide range of domains where external supervision is limited or unavailable.

摘要

强化学习(RL)已使机器学习模型在众多领域实现重大突破。最近,RL进一步赋能前沿语言模型解决数学、科学和编程等复杂问题。然而,任何RL算法的核心在于奖励函数,而奖励工程在所有领域都是公认的难题。本文提出RENT:基于熵最小化的强化学习方法——这是一种完全无监督的RL方法,无需外部奖励或真实答案,而是利用模型底层分布的熵作为内在奖励。我们发现,通过强化那些使模型对生成答案具有高置信度的思维链,模型的推理能力得以提升。实验中,我们在包括GSM8K、MATH500、AMC、AIME和GPQA等广泛使用的推理基准测试集上,以及Qwen和Mistral系列不同规模的模型上验证了这种改进。我们的无监督学习方法具有普适性,可广泛应用于外部监督有限或缺失的众多领域。


Fine-Grained and Thematic Evaluation of LLMs in Social Deduction Game

Abstract

arXiv:2408.09946v2 Announce Type: replace Abstract: Recent studies have investigated whether large language models (LLMs) can support obscure communication that requires specialized skills, such as inferring subtext or doublespeak. To conduct the investigation, researchers have used social deduction games (SDGs) as their experimental environment, in which players conceal and infer specific information. However, prior work has often overlooked how LLMs should be evaluated in such settings. Specifically, we point out two issues with the evaluation methods they employed. First, metrics used in prior studies are coarse-grained as they are based on overall game outcomes that often fail to capture event-level behaviors; Second, error analyses have lacked structured methodologies capable of producing insights that meaningfully support evaluation outcomes. To address these issues, we propose a macroscopic and systematic approach to the investigation. Specifically, we introduce seven fine-grained metrics that resolve the first issue. To tackle the second issue, we conducted a thematic analysis and identified four major reasoning failures that undermine LLMs' performance in obscured communication.

摘要

近期研究探讨了大语言模型(LLMs)是否能够支持需要特殊技能的隐蔽交流,例如推断潜台词或双关语。为此,研究者采用社交推理游戏(SDGs)作为实验环境,在该环境中玩家需要隐藏和推断特定信息。然而,先前工作往往忽视了在此类场景下应如何评估LLMs。具体而言,我们指出其所采用评估方法存在的两个问题:首先,既有研究使用的度量指标较为粗粒度,这些基于整体游戏结果的指标往往无法捕捉事件级行为;其次,错误分析缺乏结构化方法,难以产生能有效支撑评估结论的深入洞见。针对这些问题,我们提出了一种宏观且系统化的研究路径:通过引入七个细粒度指标解决首个问题;为应对第二个问题,我们开展主题分析并识别出四大类损害LLMs隐蔽交流性能的推理缺陷。


Automating Thought of Search: A Journey Towards Soundness and Completeness

Abstract

arXiv:2408.11326v2 Announce Type: replace Abstract: Large language models (LLMs) are being used to solve planning problems that require search. Most of the literature uses LLMs as world models to define the search space, forgoing soundness for the sake of flexibility. A recent work, Thought of Search (ToS), proposed defining the search space with code, having LLMs produce that code. ToS requires a human in the loop, collaboratively producing a sound successor function and goal test. The result, however, is worth the effort: all the tested datasets were solved with 100% accuracy. Consequently, there is great potential to automate the ToS process. We take a first major step towards automating ToS (AutoToS), taking the human out of the loop of interactions with the language model. AutoToS guides the language model step by step towards the generation of sound and complete search components, through feedback from both generic and domain specific unit tests. We show that AutoToS is able to achieve 100% accuracy on all the evaluated domains with a small number of LLM calls.

摘要

大型语言模型(LLMs)正被用于解决需要搜索的规划问题。现有研究大多将LLMs作为世界模型来定义搜索空间,为追求灵活性而牺牲了严谨性。近期研究'搜索思维'(ToS)提出通过代码定义搜索空间,由LLMs生成该代码。ToS需要人工参与循环协作,共同生成严谨的后继函数和目标测试。尽管过程复杂,但其成果显著:所有测试数据集均实现了100%的准确率。因此,实现ToS流程自动化具有巨大潜力。本研究在自动化ToS(AutoToS)方向上迈出重要一步,消除了与语言模型交互中的人工环节。AutoToS通过通用和领域特定单元测试的反馈,逐步引导语言模型生成严谨且完备的搜索组件。实验表明,AutoToS能在少量LLM调用下,在所有评估领域实现100%的准确率。


LAMBDA: A Large Model Based Data Agent

Abstract

arXiv:2407.17535v3 Announce Type: replace Abstract: We introduce LArge Model Based Data Agent (LAMBDA), a novel open-source, code-free multi-agent data analysis system that leverages the power of large language models. LAMBDA is designed to address data analysis challenges in data-driven applications through innovatively designed data agents using natural language. At the core of LAMBDA are two key agent roles: the programmer and the inspector, which are engineered to work together seamlessly. Specifically, the programmer generates code based on the user's instructions and domain-specific knowledge, while the inspector debugs the code when necessary. To ensure robustness and handle adverse scenarios, LAMBDA features a user interface that allows direct user intervention. Moreover, LAMBDA can flexibly integrate external models and algorithms through our proposed Knowledge Integration Mechanism, catering to the needs of customized data analysis. LAMBDA has demonstrated strong performance on various data analysis tasks. It has the potential to enhance data analysis paradigms by seamlessly integrating human and artificial intelligence, making it more accessible, effective, and efficient for users from diverse backgrounds. The strong performance of LAMBDA in solving data analysis problems is demonstrated using real-world data examples. The code for LAMBDA is available at https://github.com/AMA-CMFAI/LAMBDA and videos of three case studies can be viewed at https://www.polyu.edu.hk/ama/cmfai/lambda.html.

摘要

我们推出LAMBDA(基于大型模型的数据代理),这是一种新颖的开源、免代码多代理数据分析系统,其核心在于充分利用大型语言模型的强大能力。LAMBDA通过创新设计的自然语言数据代理,旨在解决数据驱动应用中的分析挑战。该系统包含两个关键代理角色:程序员和检查员,它们被设计为协同工作。具体而言,程序员根据用户指令和领域知识生成代码,而检查员则在必要时进行代码调试。为确保系统鲁棒性并应对异常情况,LAMBDA配备了允许用户直接干预的操作界面。此外,通过我们提出的知识整合机制,LAMBDA能够灵活集成外部模型与算法,满足定制化数据分析需求。实验表明,LAMBDA在各类数据分析任务中表现优异。该系统有望通过无缝整合人类智能与人工智能来革新数据分析范式,使不同背景的用户都能更便捷、高效地完成分析工作。我们通过实际数据案例验证了LAMBDA在解决数据分析问题方面的卓越性能。LAMBDA的代码已发布于https://github.com/AMA-CMFAI/LAMBDA,三个案例研究的演示视频可在https://www.polyu.edu.hk/ama/cmfai/lambda.html查看。


MINDSTORES: Memory-Informed Neural Decision Synthesis for Task-Oriented Reinforcement in Embodied Systems

Abstract

arXiv:2501.19318v2 Announce Type: replace Abstract: While large language models (LLMs) have shown promising capabilities as zero-shot planners for embodied agents, their inability to learn from experience and build persistent mental models limits their robustness in complex open-world environments like Minecraft. We introduce MINDSTORES, an experience-augmented planning framework that enables embodied agents to build and leverage mental models through natural interaction with their environment. Drawing inspiration from how humans construct and refine cognitive mental models, our approach extends existing zero-shot LLM planning by maintaining a database of past experiences that informs future planning iterations. The key innovation is representing accumulated experiences as natural language embeddings of (state, task, plan, outcome) tuples, which can then be efficiently retrieved and reasoned over by an LLM planner to generate insights and guide plan refinement for novel states and tasks. Through extensive experiments in the MineDojo environment, a simulation environment for agents in Minecraft that provides low-level controls for Minecraft, we find that MINDSTORES learns and applies its knowledge significantly better than existing memory-based LLM planners while maintaining the flexibility and generalization benefits of zero-shot approaches, representing an important step toward more capable embodied AI systems that can learn continuously through natural experience.

摘要

虽然大型语言模型(LLMs)作为具身智能体的零样本规划器已展现出潜力,但其无法从经验中学习并建立持久心智模型的特性限制了它们在《我的世界》等复杂开放世界环境中的鲁棒性。我们提出MINDSTORES——一种经验增强型规划框架,使具身智能体能够通过与环境的自然交互构建并利用心智模型。受人类构建与完善认知心智模型的启发,该方法通过维护记录过往经验的数据库来增强现有零样本LLM规划能力,这些经验可为后续规划迭代提供参考。其核心创新在于将累积经验表示为(状态、任务、规划、结果)元组的自然语言嵌入向量,LLM规划器可高效检索这些向量并进行推理,从而针对新状态和任务生成洞见并指导规划优化。在MineDojo(一个为《我的世界》智能体提供底层控制接口的仿真环境)中的大量实验表明,MINDSTORES在知识学习与应用方面显著优于现有基于记忆的LLM规划器,同时保持了零样本方法的灵活性与泛化优势,这为构建能通过自然经验持续学习的更强具身AI系统迈出了重要一步。


Position: Don't Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints

Abstract

arXiv:2503.01747v3 Announce Type: replace Abstract: Rigorous statistical evaluations of large language models (LLMs), including valid error bars and significance testing, are essential for meaningful and reliable performance assessment. Currently, when such statistical measures are reported, they typically rely on the Central Limit Theorem (CLT). In this position paper, we argue that while CLT-based methods for uncertainty quantification are appropriate when benchmarks consist of thousands of examples, they fail to provide adequate uncertainty estimates for LLM evaluations that rely on smaller, highly specialized benchmarks. In these small-data settings, we demonstrate that CLT-based methods perform very poorly, usually dramatically underestimating uncertainty (i.e. producing error bars that are too small). We give recommendations for alternative frequentist and Bayesian methods that are both easy to implement and more appropriate in these increasingly common scenarios. We provide a simple Python library for these Bayesian methods at https://github.com/sambowyer/bayes_evals .

摘要

对大语言模型(LLMs)进行严格的统计评估,包括有效的误差范围和显著性检验,对于实现有意义且可靠的性能评估至关重要。目前,当报告此类统计指标时,通常依赖于中心极限定理(CLT)。在本立场文件中,我们认为,虽然基于CLT的不确定性量化方法适用于包含数千个示例的基准测试,但它们无法为依赖较小、高度专业化基准测试的LLM评估提供充分的不确定性估计。在这些小数据场景中,我们证明基于CLT的方法表现非常差,通常会严重低估不确定性(即产生的误差范围过小)。我们针对这些日益常见的情况,提出了替代的频率主义和贝叶斯方法建议,这些方法既易于实现又更为合适。我们在https://github.com/sambowyer/bayes_evals 上提供了一个简单的Python库来实现这些贝叶斯方法。


Kimi k1.5: Scaling Reinforcement Learning with LLMs

Abstract

arXiv:2501.12599v3 Announce Type: replace Abstract: Language model pretraining with next token prediction has proved effective for scaling compute but is limited to the amount of available training data. Scaling reinforcement learning (RL) unlocks a new axis for the continued improvement of artificial intelligence, with the promise that large language models (LLMs) can scale their training data by learning to explore with rewards. However, prior published work has not produced competitive results. In light of this, we report on the training practice of Kimi k1.5, our latest multi-modal LLM trained with RL, including its RL training techniques, multi-modal data recipes, and infrastructure optimization. Long context scaling and improved policy optimization methods are key ingredients of our approach, which establishes a simplistic, effective RL framework without relying on more complex techniques such as Monte Carlo tree search, value functions, and process reward models. Notably, our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities -- e.g., 77.5 on AIME, 96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista -- matching OpenAI's o1. Moreover, we present effective long2short methods that use long-CoT techniques to improve short-CoT models, yielding state-of-the-art short-CoT reasoning results -- e.g., 60.8 on AIME, 94.6 on MATH500, 47.3 on LiveCodeBench -- outperforming existing short-CoT models such as GPT-4o and Claude Sonnet 3.5 by a large margin (up to +550%).

摘要

基于下一词元预测的语言模型预训练方法虽能有效扩展计算规模,但其性能受限于可用训练数据量。强化学习(RL)的规模化应用为人工智能持续进步开辟了新路径,其核心在于大型语言模型(LLMs)可通过奖励驱动的探索机制自主扩展训练数据。然而,此前公开研究尚未取得突破性成果。为此,我们报告了多模态LLM模型Kimi k1.5的强化学习训练实践,包括RL训练技术、多模态数据配方及基础设施优化方案。长上下文扩展与改进的策略优化方法是本研究的核心要素,由此建立了一个简洁高效的RL框架,无需依赖蒙特卡洛树搜索、价值函数或过程奖励模型等复杂技术。值得注意的是,我们的系统在多项基准测试和多模态任务中达到最先进推理性能——例如AIME 77.5分、MATH500 96.2分、Codeforces 94百分位、MathVista 74.9分——与OpenAI o1持平。此外,我们提出有效的长链推理优化技术(long2short),利用长思维链(long-CoT)方法提升短思维链(short-CoT)模型性能,取得当前最佳的短链推理结果——如AIME 60.8分、MATH500 94.6分、LiveCodeBench 47.3分——显著超越GPT-4o和Claude Sonnet 3.5等现有短链模型(最高提升达550%)。


Patterns Over Principles: The Fragility of Inductive Reasoning in LLMs under Noisy Observations

Abstract

arXiv:2502.16169v2 Announce Type: replace Abstract: Inductive reasoning, a cornerstone of human cognition, enables generalization from limited data but hasn't yet been fully achieved by large language models (LLMs). While modern LLMs excel at reasoning tasks, their ability to maintain stable and consistent rule abstraction under imperfect observations remains underexplored. To fill this gap, in this work, we introduce Robust Rule Induction, a task that evaluates LLMs' capability in inferring rules from data that are fused with noisy examples. To address this task, we further propose Sample-steered Rule Refinement (SRR), a method enhancing reasoning stability via observation diversification and execution-guided feedback. Experiments across arithmetic, cryptography, and list functions reveal: (1) SRR outperforms other methods with minimal performance degradation under noise; (2) Despite slight accuracy variation, LLMs exhibit instability under noise (e.g., 0% accuracy change with only 70% consistent score); (3) Counterfactual task gaps highlight LLMs' reliance on memorized patterns over genuine abstraction. Our findings challenge LLMs' reasoning robustness, revealing susceptibility to hypothesis drift and pattern overfitting, while providing empirical evidence critical for developing human-like inductive systems. Code and data are available at https://github.com/HKUST-KnowComp/Robust-Rule-Induction.

摘要

归纳推理作为人类认知的基石,能够从有限数据中进行泛化,但大型语言模型(LLMs)尚未完全实现这一能力。尽管现代LLMs在推理任务上表现优异,其在非完美观察下保持稳定且一致的规则抽象能力仍未得到充分探索。为填补这一空白,本研究提出"鲁棒规则归纳"任务,用于评估LLMs从含有噪声示例的数据中推断规则的能力。针对该任务,我们进一步提出样本引导规则细化(SRR)方法,通过观察多样化和执行引导反馈来增强推理稳定性。在算术、密码学和列表函数领域的实验表明:(1)SRR在噪声环境下性能下降最小,优于其他方法;(2)尽管准确率变化较小,LLMs在噪声下表现出不稳定性(例如准确率变化0%时,一致性评分仅70%);(3)反事实任务差距揭示了LLMs更依赖记忆模式而非真正的抽象能力。我们的研究发现挑战了LLMs的推理鲁棒性,揭示了其存在假设漂移和模式过拟合的脆弱性,同时为开发类人归纳系统提供了关键实证依据。代码与数据详见https://github.com/HKUST-KnowComp/Robust-Rule-Induction。


Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration

Abstract

arXiv:2502.11882v5 Announce Type: replace Abstract: Agents built on large language models (LLMs) have excelled in turn-by-turn human-AI collaboration but struggle with simultaneous tasks requiring real-time interaction. Latency issues and the challenge of inferring variable human strategies hinder their ability to make autonomous decisions without explicit instructions. Through experiments with current independent System 1 and System 2 methods, we validate the necessity of using Dual Process Theory (DPT) in real-time tasks. We propose DPT-Agent, a novel language agent framework that integrates System 1 and System 2 for efficient real-time simultaneous human-AI collaboration. DPT-Agent's System 1 uses a Finite-state Machine (FSM) and code-as-policy for fast, intuitive, and controllable decision-making. DPT-Agent's System 2 integrates Theory of Mind (ToM) and asynchronous reflection to infer human intentions and perform reasoning-based autonomous decisions. We demonstrate the effectiveness of DPT-Agent through further experiments with rule-based agents and human collaborators, showing significant improvements over mainstream LLM-based frameworks. DPT-Agent can effectively help LLMs convert correct slow thinking and reasoning into executable actions, thereby improving performance. To the best of our knowledge, DPT-Agent is the first language agent framework that achieves successful real-time simultaneous human-AI collaboration autonomously. Code of DPT-Agent can be found in https://github.com/sjtu-marl/DPT-Agent.

摘要

基于大语言模型(LLM)构建的智能体在回合制人机协作中表现出色,但在需要实时交互的并行任务中存在困难。延迟问题以及推断人类多变策略的挑战,阻碍了其在无明确指令时进行自主决策的能力。通过对现有独立系统1和系统2方法的实验验证,我们证实了双过程理论(DPT)在实时任务中的必要性。本文提出DPT-Agent——一种新型语言智能体框架,通过整合系统1与系统2实现高效实时并行人机协作。DPT-Agent的系统1采用有限状态机(FSM)和代码即策略机制,实现快速、直观且可控的决策;系统2整合心理理论(ToM)与异步反思机制,用以推断人类意图并执行基于推理的自主决策。我们通过与规则型智能体及人类协作者的进一步实验,证明DPT-Agent相较主流LLM框架具有显著优势,能有效帮助大语言模型将正确的慢思考推理转化为可执行动作,从而提升任务表现。据我们所知,DPT-Agent是首个实现自主实时并行人机协作成功的语言智能体框架。项目代码详见https://github.com/sjtu-marl/DPT-Agent。


Agent-Centric Personalized Multiple Clustering with Multi-Modal LLMs

Abstract

arXiv:2503.22241v3 Announce Type: replace Abstract: Personalized multiple clustering aims to generate diverse partitions of a dataset based on different user-specific aspects, rather than a single clustering. It has recently drawn research interest for accommodating varying user preferences. Recent approaches primarily use CLIP embeddings with proxy learning to extract representations biased toward user clustering preferences. However, CLIP primarily focuses on coarse image-text alignment, lacking a deep contextual understanding of user interests. To overcome these limitations, we propose an agent-centric personalized clustering framework that leverages multi-modal large language models (MLLMs) as agents to comprehensively traverse a relational graph to search for clusters based on user interests. Due to the advanced reasoning mechanism of MLLMs, the obtained clusters align more closely with user-defined criteria than those obtained from CLIP-based representations. To reduce computational overhead, we shorten the agents' traversal path by constructing a relational graph using user-interest-biased embeddings extracted by MLLMs. A large number of weakly connected edges can be filtered out based on embedding similarity, facilitating an efficient traversal search for agents. Experimental results show that the proposed method achieves NMI scores of 0.9667 and 0.9481 on the Card Order and Card Suits benchmarks, respectively, largely improving the SOTA model by over 140%.

摘要

个性化多聚类旨在根据用户特定的不同方面生成数据集的多样化划分,而非单一聚类。近年来,该方法因能适应多样化的用户偏好而受到研究关注。现有方法主要利用CLIP嵌入与代理学习来提取偏向用户聚类偏好的表征。然而,CLIP主要关注粗粒度的图像-文本对齐,缺乏对用户兴趣的深层上下文理解。为克服这些局限,我们提出一种以智能体为中心的个性化聚类框架,通过多模态大语言模型(MLLMs)作为智能体全面遍历关系图,从而基于用户兴趣搜索聚类簇。得益于MLLMs的高级推理机制,所获聚类簇比基于CLIP表征的结果更贴合用户定义标准。为降低计算开销,我们利用MLLMs提取的用户兴趣偏置嵌入构建关系图以缩短智能体遍历路径。基于嵌入相似度可过滤大量弱连接边,从而提升智能体的遍历搜索效率。实验结果表明,该方法在Card Order和Card Suits基准上的NMI分数分别达到0.9667和0.9481,较现有最优模型提升超过140%。


End-to-End Breast Cancer Radiotherapy Planning via LMMs with Consistency Embedding

Abstract

arXiv:2311.15876v4 Announce Type: replace-cross Abstract: Recent advances in AI foundation models have significant potential for lightening the clinical workload by mimicking the comprehensive and multi-faceted approaches used by medical professionals. In the field of radiation oncology, the integration of multiple modalities holds great importance, so the opportunity of foundational model is abundant. Inspired by this, here we present RO-LMM, a multi-purpose, comprehensive large multimodal model (LMM) tailored for the field of radiation oncology. This model effectively manages a series of tasks within the clinical workflow, including clinical context summarization, radiation treatment plan suggestion, and plan-guided target volume segmentation by leveraging the capabilities of LMM. In particular, to perform consecutive clinical tasks without error accumulation, we present a novel Consistency Embedding Fine-Tuning (CEFTune) technique, which boosts LMM's robustness to noisy inputs while preserving the consistency of handling clean inputs. We further extend this concept to LMM-driven segmentation framework, leading to a novel Consistency Embedding Segmentation (CESEG) techniques. Experimental results including multi-centre validation confirm that our RO-LMM with CEFTune and CESEG results in promising performance for multiple clinical tasks with generalization capabilities.

摘要

人工智能基础模型的最新进展通过模拟医疗专业人员全面、多方位的工作方法,在减轻临床工作负担方面展现出巨大潜力。在放射肿瘤学领域,多模态融合具有重要价值,这为基础模型的应用提供了广阔空间。受此启发,我们提出RO-LMM——一个专为放射肿瘤学领域设计的通用型大型多模态模型(LMM)。该模型通过发挥LMM的优势,有效管理临床工作流中的系列任务,包括临床背景总结、放疗方案建议以及计划引导的靶区勾画。特别地,为实现无误差累积的连续临床任务处理,我们提出新型一致性嵌入微调技术(CEFTune),该技术在保持处理干净输入一致性的同时,增强了LMM对噪声输入的鲁棒性。我们进一步将该理念延伸至LMM驱动的分割框架,开发出创新性的一致性嵌入分割技术(CESEG)。包含多中心验证的实验结果表明,配备CEFTune和CESEG的RO-LMM在多项临床任务中均表现出卓越性能,并具备良好的泛化能力。


PrivacyRestore: Privacy-Preserving Inference in Large Language Models via Privacy Removal and Restoration

Abstract

arXiv:2406.01394v5 Announce Type: replace-cross Abstract: The widespread usage of online Large Language Models (LLMs) inference services has raised significant privacy concerns about the potential exposure of private information in user inputs to malicious eavesdroppers. Existing privacy protection methods for LLMs suffer from either insufficient privacy protection, performance degradation, or large inference time overhead. To address these limitations, we propose PrivacyRestore, a plug-and-play method to protect the privacy of user inputs during LLM inference. The server first trains restoration vectors for each privacy span and then release to clients. Privacy span is defined as a contiguous sequence of tokens within a text that contain private information. The client then aggregate restoration vectors of all privacy spans in the input into a single meta restoration vector which is later sent to the server side along with the input without privacy spans.The private information is restored via activation steering during inference. Furthermore, we prove that PrivacyRestore inherently prevents the linear growth of the privacy budget.We create three datasets, covering medical and legal domains, to evaluate the effectiveness of privacy preserving methods. The experimental results show that PrivacyRestore effectively protects private information and maintain acceptable levels of performance and inference overhead.

摘要

在线大型语言模型(LLM)推理服务的广泛使用引发了严重的隐私担忧,即用户输入中的私人信息可能被恶意窃听者获取。现有的LLM隐私保护方法存在隐私保护不足、性能下降或推理时间开销过大等问题。为解决这些局限性,我们提出PrivacyRestore,一种即插即用的方法,用于在LLM推理过程中保护用户输入的隐私。服务器首先为每个隐私片段训练恢复向量,然后将其发布给客户端。隐私片段定义为文本中包含私人信息的连续令牌序列。客户端随后将输入中所有隐私片段的恢复向量聚合为一个元恢复向量,该向量随后与去除隐私片段的输入一起发送至服务器端。私人信息在推理过程中通过激活导向恢复。此外,我们证明PrivacyRestore能够从根本上防止隐私预算的线性增长。我们创建了涵盖医疗和法律领域的三个数据集,以评估隐私保护方法的有效性。实验结果表明,PrivacyRestore能有效保护私人信息,同时保持可接受的性能水平和推理开销。


gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling

Abstract

arXiv:2504.14775v2 Announce Type: replace Abstract: Pipeline parallelism has emerged as a predominant approach for deploying large language models (LLMs) across distributed nodes, owing to its lower communication overhead compared to tensor parallelism. While demonstrating high throughput in request serving, pipeline parallelism often suffers from performance limitations caused by pipeline bubbles, which are primarily resulted from imbalanced computation delays across batches. Existing methods like Sarathi-Serve attempt to address this through hybrid scheduling of chunked prefill and decode tokens using a fixed token budget. However, such methods may experience significant fluctuations due to either insufficient prefill tokens or uneven distribution of decode tokens, ultimately leading to computational imbalance. To overcome these inefficiencies, we present gLLM, a globally balanced pipeline parallelism system incorporating Token Throttling to effectively mitigate the pipeline bubbles. Our Token Throttling mechanism is a fine-grained scheduling policy that independently regulates the quantities of prefill and decode tokens, thus enabling balanced computation by leveraging global information from the inference system. Specifically, for decode tokens, gLLM maintains near-consistent token count across processing batches. For prefill tokens, it dynamically adjusts batch sizes based on both total pending tokens and the memory utilization rates of key-value cache (KV cache). Furthermore, gLLM runtime adopts an asynchronous execution and message passing architecture specifically optimized for pipeline parallelism characteristics. Experimental evaluations with representative LLMs show that gLLM achieves significant performance improvements, delivering 11% to 398% higher maximum throughput compared to state-of-the-art pipeline or tensor parallelism systems, while simultaneously maintaining lower latency.

摘要

流水线并行因其相较于张量并行具有更低的通信开销,已成为跨分布式节点部署大语言模型(LLM)的主流方法。尽管在请求服务中展现出高吞吐量,流水线并行常因流水线气泡导致的性能限制而受限,这些气泡主要源于批次间计算延迟的不均衡。现有方法如Sarathi-Serve试图通过采用固定令牌预算的块式预填充和解码令牌混合调度来解决此问题。然而,此类方法可能因预填充令牌不足或解码令牌分布不均而产生显著波动,最终导致计算失衡。为克服这些低效问题,我们提出gLLM——一种集成令牌节流机制的全局均衡流水线并行系统,可有效缓解流水线气泡。我们的令牌节流机制是一种细粒度调度策略,独立调控预填充与解码令牌数量,从而通过利用推理系统的全局信息实现计算均衡。具体而言,对于解码令牌,gLLM在处理批次间保持近乎一致的令牌数量;对于预填充令牌,则根据待处理令牌总量及键值缓存(KV缓存)的内存利用率动态调整批次大小。此外,gLLM运行时采用专为流水线并行特性优化的异步执行与消息传递架构。针对代表性LLM的实验评估表明,gLLM实现了显著的性能提升,与最先进的流水线或张量并行系统相比,最大吞吐量提高11%至398%,同时保持更低延迟。


Edit Distance Robust Watermarks via Indexing Pseudorandom Codes

Abstract

arXiv:2406.02633v2 Announce Type: replace-cross Abstract: Motivated by the problem of detecting AI-generated text, we consider the problem of watermarking the output of language models with provable guarantees. We aim for watermarks which satisfy: (a) undetectability, a cryptographic notion introduced by Christ, Gunn & Zamir (2024) which stipulates that it is computationally hard to distinguish watermarked language model outputs from the model's actual output distribution; and (b) robustness to channels which introduce a constant fraction of adversarial insertions, substitutions, and deletions to the watermarked text. Earlier schemes could only handle stochastic substitutions and deletions, and thus we are aiming for a more natural and appealing robustness guarantee that holds with respect to edit distance. Our main result is a watermarking scheme which achieves both undetectability and robustness to edits when the alphabet size for the language model is allowed to grow as a polynomial in the security parameter. To derive such a scheme, we follow an approach introduced by Christ & Gunn (2024), which proceeds via first constructing pseudorandom codes satisfying undetectability and robustness properties analogous to those above; our key idea is to handle adversarial insertions and deletions by interpreting the symbols as indices into the codeword, which we call indexing pseudorandom codes. Additionally, our codes rely on weaker computational assumptions than used in previous work. Then we show that there is a generic transformation from such codes over large alphabets to watermarking schemes for arbitrary language models.

摘要

受AI生成文本检测问题的启发,我们研究了具有可证明保证的语言模型输出水印技术。我们致力于实现满足以下特性的水印方案:(a)不可检测性——这是Christ、Gunn与Zamir(2024)提出的密码学概念,要求计算上难以区分带水印的语言模型输出与模型真实输出分布;(b)对对抗性操作的鲁棒性——即当水印文本遭受恒定比例的对抗性插入、替换和删除时仍能保持有效性。现有方案仅能处理随机替换和删除,因此我们追求更具自然吸引力且基于编辑距离的鲁棒性保证。

我们的核心成果是:当语言模型的字母表规模随安全参数呈多项式增长时,可同时实现不可检测性与编辑鲁棒性的水印方案。该方案的构建遵循Christ与Gunn(2024)提出的方法框架,即首先构造满足类似不可检测性与鲁棒性的伪随机码;我们的关键创新是通过将符号解释为码字索引(称为索引伪随机码)来处理对抗性插入和删除操作。此外,该编码方案所依赖的计算假设弱于前人工作。最后我们证明:存在从大字母表编码到任意语言模型水印方案的通用转换方法。


Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective

Abstract

arXiv:2406.14023v3 Announce Type: replace-cross Abstract: As large language models (LLMs) become an important way of information access, there have been increasing concerns that LLMs may intensify the spread of unethical content, including implicit bias that hurts certain populations without explicit harmful words. In this paper, we conduct a rigorous evaluation of LLMs' implicit bias towards certain demographics by attacking them from a psychometric perspective to elicit agreements to biased viewpoints. Inspired by psychometric principles in cognitive and social psychology, we propose three attack approaches, i.e., Disguise, Deception, and Teaching. Incorporating the corresponding attack instructions, we built two benchmarks: (1) a bilingual dataset with biased statements covering four bias types (2.7K instances) for extensive comparative analysis, and (2) BUMBLE, a larger benchmark spanning nine common bias types (12.7K instances) for comprehensive evaluation. Extensive evaluation of popular commercial and open-source LLMs shows that our methods can elicit LLMs' inner bias more effectively than competitive baselines. Our attack methodology and benchmarks offer an effective means of assessing the ethical risks of LLMs, driving progress toward greater accountability in their development. Our code, data and benchmarks are available at https://github.com/yuchenwen1/ImplicitBiasPsychometricEvaluation and https://github.com/yuchenwen1/BUMBLE.

摘要

随着大语言模型(LLMs)成为信息获取的重要途径,人们日益担忧LLMs可能加剧不道德内容的传播,包括那些通过隐性偏见伤害特定群体而无需使用明显有害措辞的现象。本文通过心理测量学视角对LLMs针对特定人群的隐性偏见展开严格评估,通过诱导模型认同偏见观点来揭示其内在倾向。受认知与社会心理学中心理测量学原理启发,我们提出三种攻击方法:伪装(Disguise)、欺骗(Deception)和教导(Teaching)。基于相应攻击指令,我们构建了两个基准测试集:(1)涵盖四种偏见类型(2.7K个实例)的双语偏见陈述数据集,用于广泛对比分析;(2)BUMBLE基准集,包含九种常见偏见类型(12.7K个实例)以实现全面评估。对主流商业及开源LLMs的广泛实验表明,我们的方法比竞争基线更能有效诱发模型内在偏见。本研究的攻击方法论与基准测试集为评估LLMs伦理风险提供了有效工具,可推动其开发过程的责任化进程。代码、数据及基准集详见https://github.com/yuchenwen1/ImplicitBiasPsychometricEvaluationhttps://github.com/yuchenwen1/BUMBLE。


Overcoming the Machine Penalty with Imperfectly Fair AI Agents

Abstract

arXiv:2410.03724v3 Announce Type: replace-cross Abstract: Despite rapid technological progress, effective human-machine cooperation remains a significant challenge. Humans tend to cooperate less with machines than with fellow humans, a phenomenon known as the machine penalty. Here, we show that artificial intelligence (AI) agents powered by large language models can overcome this penalty in social dilemma games with communication. In a pre-registered experiment with 1,152 participants, we deploy AI agents exhibiting three distinct personas: selfish, cooperative, and fair. However, only fair agents elicit human cooperation at rates comparable to human-human interactions. Analysis reveals that fair agents, similar to human participants, occasionally break pre-game cooperation promises, but nonetheless effectively establish cooperation as a social norm. These results challenge the conventional wisdom of machines as altruistic assistants or rational actors. Instead, our study highlights the importance of AI agents reflecting the nuanced complexity of human social behaviors -- imperfect yet driven by deeper social cognitive processes.

摘要

尽管技术发展迅速,但实现有效的人机协作仍面临重大挑战。人类与机器的合作意愿往往低于人际合作,这种现象被称为'机器惩罚'。本研究表明,在具备沟通机制的社会困境博弈中,基于大语言模型的人工智能代理能够克服这种惩罚效应。通过一项预注册实验(涉及1,152名参与者),我们部署了三种不同行为特征的AI代理:自私型、合作型和公平型。结果显示,只有公平型AI能激发与人类互动相当的合作水平。分析表明,公平型AI与人类参与者类似,虽会偶尔违背事前合作承诺,却能有效建立合作的社会规范。这些发现挑战了将机器视为纯粹利他助手或完全理性行为体的传统认知。相反,我们的研究强调AI代理需要体现人类社会行为的微妙复杂性——虽不完美却遵循深层社会认知机制。


Mini-batch Coresets for Memory-efficient Language Model Training on Data Mixtures

Abstract

arXiv:2407.19580v4 Announce Type: replace-cross Abstract: Training with larger mini-batches improves the convergence rate and can yield superior performance. However, training with large mini-batches becomes prohibitive for Large Language Models (LLMs), due to the large GPU memory requirement. To address this problem, an effective approach is finding small mini-batch coresets that closely match the gradient of larger mini-batches. However, this approach becomes infeasible and ineffective for LLMs, due to the highly imbalanced mixture of sources in language data, use of the Adam optimizer, and the very large gradient dimensionality of LLMs. In this work, we address the above challenges by proposing Coresets for Training LLMs (CoLM). First, we show that mini-batch coresets found by gradient matching do not contain representative examples of the small sources w.h.p., and thus including all examples of the small sources in the mini-batch coresets is crucial for optimal performance. Second, we normalize the gradients by their historical exponential to find mini-batch coresets for training with Adam. Finally, we leverage zeroth-order methods to find smooth gradient of the last V-projection matrix and sparsify it to keep the dimensions with the largest normalized gradient magnitude. We apply CoLM to fine-tuning Phi-2, Phi-3, Zephyr, and Llama-3 models with LoRA on MathInstruct and SuperGLUE benchmark. Remarkably, CoLM reduces the memory requirement of fine-tuning by 2x and even outperforms training with 4x larger mini-batches. Moreover, CoLM seamlessly integrates with existing memory-efficient training methods like LoRA, further reducing the memory requirements of training LLMs. Our code is available at https://github.com/BigML-CS-UCLA/CoLM.

摘要

摘要:采用较大规模的小批量训练可提升收敛速度并获得更优性能。然而对于大语言模型(LLMs),由于GPU显存需求巨大,大规模小批量训练难以实现。为解决该问题,寻找能精准匹配大批量梯度的小批量核心集成为有效途径。但受语言数据中高度不平衡的混合来源、Adam优化器的使用以及LLMs超高维梯度的制约,该方法对LLMs存在可行性低、效果差的问题。本研究提出LLM训练核心集(CoLM)应对上述挑战:首先证明梯度匹配获得的小批量核心集大概率不包含小数据源的代表性样本,揭示将小数据源全部样本纳入核心集对最优性能的关键作用;其次通过历史指数归一化梯度实现Adam优化器下的核心集构建;最后采用零阶方法获取末V投影矩阵的平滑梯度并进行稀疏化处理,保留归一化梯度幅值最大的维度。在MathInstruct和SuperGLUE基准测试中,CoLM应用于Phi-2、Phi-3、Zephyr及Llama-3模型的LoRA微调,显著将微调显存需求降低2倍,且性能超越4倍批量训练。此外,CoLM可与LoRA等现有高效显存训练方法无缝集成,进一步降低LLMs训练资源需求。代码已开源:https://github.com/BigML-CS-UCLA/CoLM。


CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling

Abstract

arXiv:2409.19291v3 Announce Type: replace-cross Abstract: Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in multimodal intelligence. However, recent studies discovered that CLIP can only encode one aspect of the feature space, leading to substantial information loss and indistinctive features. To mitigate this issue, this paper introduces a novel strategy that fine-tunes a series of complementary CLIP models and transforms them into a CLIP-MoE. Specifically, we propose a model-agnostic Diversified Multiplet Upcycling (DMU) framework for CLIP. Instead of training multiple CLIP models from scratch, DMU leverages a pre-trained CLIP and fine-tunes it into a diverse set with highly cost-effective multistage contrastive learning, thus capturing distinct feature subspaces efficiently. To fully exploit these fine-tuned models while minimizing computational overhead, we transform them into a CLIP-MoE, which dynamically activates a subset of CLIP experts, achieving an effective balance between model capacity and computational cost. Comprehensive experiments demonstrate the superior performance of CLIP-MoE across various zero-shot retrieval, zero-shot image classification tasks, and downstream Multimodal Large Language Model (MLLM) benchmarks when used as a vision encoder.

摘要

对比语言-图像预训练(CLIP)已成为多模态智能领域的基石。然而,近期研究发现CLIP仅能编码特征空间的单一维度,导致显著信息丢失与特征区分度不足。为缓解这一问题,本文提出一种创新策略:通过微调一系列互补的CLIP模型并将其转化为CLIP混合专家模型(CLIP-MoE)。具体而言,我们设计了模型无关的多样化多元升级框架(DMU)用于CLIP。该框架无需从头训练多个CLIP模型,而是基于预训练CLIP模型,通过高性价比的多阶段对比学习将其微调为具有多样性的模型集合,从而高效捕获不同特征子空间。为充分挖掘这些微调模型潜力并最小化计算开销,我们将其转化为CLIP-MoE模型,该模型动态激活CLIP专家子集,在模型容量与计算成本间实现有效平衡。综合实验表明,CLIP-MoE在零样本检索、零样本图像分类任务以及作为视觉编码器应用于下游多模态大语言模型(MLLM)基准测试时均展现出卓越性能。


Exploring the Limitations of Mamba in COPY and CoT Reasoning

Abstract

arXiv:2410.03810v2 Announce Type: replace-cross Abstract: Transformers have become the backbone of modern Large Language Models (LLMs); however, their inference overhead grows linearly with the sequence length, posing challenges for modeling long sequences. In light of this, Mamba has attracted attention for maintaining a constant inference size, with empirical evidence demonstrating that it can match Transformer performance in sequence modeling while significantly reducing computational costs. However, an open question remains: can Mamba always bring savings while achieving performance comparable to Transformers? In this paper, we focus on analyzing the expressive ability of Mamba to perform our defined COPY operation and Chain of Thought (CoT) reasoning. First, inspired by the connection between Mamba and linear attention, we show that constant-sized Mamba may struggle to perform COPY operations while Transformers can handle them more easily. However, when the size of Mamba grows linearly with the input sequence length, it can accurately perform COPY, but in this case, Mamba no longer provides overhead savings. Based on this observation, we further analyze Mamba's ability to tackle CoT tasks, which can be described by the Dynamic Programming (DP) problems. Our findings suggest that to solve arbitrary DP problems, the total cost of Mamba is still comparable to standard Transformers. However, similar to efficient Transformers, when facing DP problems with favorable properties such as locality, Mamba can provide savings in overhead. Our experiments on the copy and CoT tasks further demonstrate Mamba's limitations compared to Transformers in learning these tasks.

摘要

Transformer已成为现代大语言模型(LLM)的核心架构,但其推理开销随序列长度线性增长,这对长序列建模提出了挑战。鉴于此,Mamba因能保持恒定推理规模而受到关注,实证研究表明其在序列建模中能达到与Transformer相当的性能,同时显著降低计算成本。然而,一个悬而未决的问题是:Mamba是否总能实现与Transformer相当的性能并带来计算节省?本文重点分析了Mamba执行我们定义的COPY操作和思维链(CoT)推理的表达能力。首先,受Mamba与线性注意力之间关联的启发,我们发现恒定规模的Mamba可能难以执行COPY操作,而Transformer能更轻松地处理该任务。但当Mamba的规模随输入序列长度线性增长时,其可精确执行COPY操作,但此时Mamba不再提供开销节省。基于这一观察,我们进一步分析了Mamba处理可描述为动态规划(DP)问题的CoT任务的能力。研究结果表明,要解决任意DP问题,Mamba的总成本仍与标准Transformer相当。然而,与高效Transformer类似,当面对具有局部性等有利特性的DP问题时,Mamba可降低开销。我们在COPY和CoT任务上的实验进一步验证了Mamba在学习这些任务时相比Transformer的局限性。


The Stepwise Deception: Simulating the Evolution from True News to Fake News with LLM Agents

Abstract

arXiv:2410.19064v2 Announce Type: replace-cross Abstract: With the growing spread of misinformation online, understanding how true news evolves into fake news has become crucial for early detection and prevention. However, previous research has often assumed fake news inherently exists rather than exploring its gradual formation. To address this gap, we propose FUSE (Fake news evolUtion Simulation framEwork), a novel Large Language Model (LLM)-based simulation approach explicitly focusing on fake news evolution from real news. Our framework model a social network with four distinct types of LLM agents commonly observed in daily interactions: spreaders who propagate information, commentators who provide interpretations, verifiers who fact-check, and bystanders who observe passively to simulate realistic daily interactions that progressively distort true news. To quantify these gradual distortions, we develop FUSE-EVAL, a comprehensive evaluation framework measuring truth deviation along multiple linguistic and semantic dimensions. Results show that FUSE effectively captures fake news evolution patterns and accurately reproduces known fake news, aligning closely with human evaluations. Experiments demonstrate that FUSE accurately reproduces known fake news evolution scenarios, aligns closely with human judgment, and highlights the importance of timely intervention at early stages. Our framework is extensible, enabling future research on broader scenarios of fake news.

摘要

随着网络虚假信息的日益蔓延,理解真实新闻如何演变为虚假新闻对于早期检测和预防变得至关重要。然而,先前研究往往假设虚假新闻天然存在,而非探索其逐步形成过程。为填补这一空白,我们提出FUSE(虚假新闻演化模拟框架)——一种基于大语言模型(LLM)的新型模拟方法,专门研究真实新闻向虚假新闻的演变过程。该框架通过构建包含四类典型LLM代理的社交网络(日常互动中常见的传播者、评论者、验证者和旁观者),模拟逐步扭曲真实新闻的现实互动场景。为量化这种渐进式失真,我们开发了FUSE-EVAL评估框架,从多维度语言和语义特征测量真相偏离程度。实验表明,FUSE能有效捕捉虚假新闻演化规律,精确复现已知虚假新闻案例,并与人类评估结果高度一致。研究证实该框架能准确还原已知的虚假新闻演化场景,同时揭示早期及时干预的重要性。本框架具有良好的可扩展性,可为更广泛的虚假新闻研究场景提供支持。


Revisiting In-Context Learning with Long Context Language Models

Abstract

arXiv:2412.16926v3 Announce Type: replace-cross Abstract: In-Context Learning (ICL) is a technique by which language models make predictions based on examples provided in their input context. Previously, their context window size imposed a limit on the number of examples that can be shown, making example selection techniques crucial for identifying the maximally effective set of examples. However, the recent advent of Long Context Language Models (LCLMs) has significantly increased the number of examples that can be included in context, raising an important question of whether ICL performance in a many-shot regime is still sensitive to the method of sample selection. To answer this, we revisit these approaches in the context of LCLMs through extensive experiments on 18 datasets spanning 4 tasks. Surprisingly, we observe that sophisticated example selection techniques do not yield significant improvements over a simple random sample selection method. Instead, we discover that the advent of LCLMs has fundamentally shifted the challenge of ICL from that of selecting the most effective examples to that of collecting sufficient examples to fill the context window. Specifically, in certain datasets, including all available examples does not fully utilize the context window; however, by augmenting the examples in context with a simple data augmentation approach, we substantially improve ICL performance by 5%.

摘要

上下文学习(ICL)是一种语言模型基于输入上下文中提供的示例进行预测的技术。此前,其上下文窗口大小限制了可展示示例的数量,使得示例选择技术对于识别最有效示例集至关重要。然而,近期长上下文语言模型(LCLMs)的出现显著增加了上下文中可包含的示例数量,这引发了一个重要问题:在多示例机制下,ICL性能是否仍对样本选择方法敏感。为解答这一问题,我们通过在4个任务的18个数据集上进行广泛实验,重新审视了LCLMs背景下的这些方法。令人惊讶的是,我们发现复杂的示例选择技术并未比简单的随机样本选择方法带来显著改进。相反,我们发现LCLMs的出现从根本上将ICL的挑战从选择最有效示例转变为收集足够示例以填满上下文窗口。具体而言,在某些数据集中,包含所有可用示例并未充分利用上下文窗口;然而,通过采用简单的数据增强方法扩充上下文中的示例,我们将ICL性能显著提升了5%。


Gender-Neutral Large Language Models for Medical Applications: Reducing Bias in PubMed Abstracts

Abstract

arXiv:2501.06365v2 Announce Type: replace-cross Abstract: This paper presents a pipeline for mitigating gender bias in large language models (LLMs) used in medical literature by neutralizing gendered occupational pronouns. A dataset of 379,000 PubMed abstracts from 1965-1980 was processed to identify and modify pronouns tied to professions. We developed a BERT-based model, "Modern Occupational Bias Elimination with Refined Training," or "MOBERT," trained on these neutralized abstracts, and compared its performance with "1965BERT," trained on the original dataset. MOBERT achieved a 70% inclusive replacement rate, while 1965BERT reached only 4%. A further analysis of MOBERT revealed that pronoun replacement accuracy correlated with the frequency of occupational terms in the training data. We propose expanding the dataset and refining the pipeline to improve performance and ensure more equitable language modeling in medical applications.

摘要

本文提出了一种通过中性化职业相关代词来减轻医学文献中使用的大型语言模型(LLMs)性别偏见的处理流程。我们对1965-1980年间379,000篇PubMed摘要进行处理,识别并修改与职业相关的代词。基于这些中性化摘要,我们开发了名为"现代职业偏见消除精细训练模型"(MOBERT)的BERT模型,并将其与原始数据集训练的"1965BERT"进行性能比较。MOBERT实现了70%的包容性替换率,而1965BERT仅达到4%。进一步分析表明,MOBERT的代词替换准确率与训练数据中职业术语的出现频率相关。我们建议通过扩展数据集和优化流程来提升性能,从而在医学应用中实现更公平的语言建模。


Redundancy Principles for MLLMs Benchmarks

Abstract

arXiv:2501.13953v2 Announce Type: replace-cross Abstract: With the rapid iteration of Multi-modality Large Language Models (MLLMs) and the evolving demands of the field, the number of benchmarks produced annually has surged into the hundreds. The rapid growth has inevitably led to significant redundancy among benchmarks. Therefore, it is crucial to take a step back and critically assess the current state of redundancy and propose targeted principles for constructing effective MLLM benchmarks. In this paper, we focus on redundancy from three key perspectives: 1) Redundancy of benchmark capability dimensions, 2) Redundancy in the number of test questions, and 3) Cross-benchmark redundancy within specific domains. Through the comprehensive analysis over hundreds of MLLMs' performance across more than 20 benchmarks, we aim to quantitatively measure the level of redundancy lies in existing MLLM evaluations, provide valuable insights to guide the future development of MLLM benchmarks, and offer strategies to refine and address redundancy issues effectively. The code is available at https://github.com/zzc-1998/Benchmark-Redundancy.

摘要

随着多模态大语言模型(MLLMs)的快速迭代和领域需求的不断演变,每年产生的基准测试数量已激增至数百个。这种快速增长不可避免地导致了基准测试间显著的冗余问题。因此,有必要退后一步,批判性地评估当前冗余现状,并提出构建有效MLLM基准测试的针对性原则。本文从三个关键视角聚焦冗余问题:1)基准测试能力维度的冗余,2)测试题目数量的冗余,3)特定领域内跨基准测试的冗余。通过对数百个MLLMs在20余个基准测试上表现的综合分析,我们旨在量化现有MLLM评估中的冗余程度,为未来MLLM基准测试的发展提供有价值的指导见解,并提出有效优化和解决冗余问题的策略。


PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models

Abstract

arXiv:2501.03124v4 Announce Type: replace-cross Abstract: Process-level Reward Models (PRMs) are crucial for complex reasoning and decision-making tasks, where each intermediate step plays an important role in the reasoning process. Since language models are prone to various types of errors during the reasoning process, PRMs are required to possess nuanced capabilities for detecting various implicit error types in real-world scenarios. However, current benchmarks primarily focus on step correctness, failing to evaluate PRMs' performance systematically. To address this gap, we introduce PRMBench, a process-level benchmark specifically designed to assess the fine-grained error detection capabilities of PRMs. PRMBench comprises 6,216 carefully designed problems and 83,456 step-level labels, evaluating models across multiple dimensions, including simplicity, soundness, and sensitivity. In our experiments on 15 models, spanning both open-source PRMs and closed-source large language models prompted as critic models, we uncover significant weaknesses in current PRMs. These findings underscore the challenges inherent in process-level evaluation and highlight key directions for future research. We hope PRMBench can be a robust bench for advancing research on PRM evaluation and development.

摘要

过程级奖励模型(PRMs)对于复杂推理和决策任务至关重要,其中每个中间步骤在推理过程中都起着重要作用。由于语言模型在推理过程中容易产生各类错误,PRMs需要具备在真实场景中检测各种隐式错误类型的细致能力。然而,当前基准测试主要关注步骤正确性,未能系统评估PRMs的性能。为填补这一空白,我们提出了PRMBench——一个专门用于评估PRMs细粒度错误检测能力的过程级基准。PRMBench包含6,216个精心设计的问题和83,456个步骤级标签,从简洁性、健全性和敏感性等多维度评估模型。在对15个模型(包括开源PRMs和作为评判模型的闭源大语言模型)的实验中,我们发现了当前PRMs存在显著缺陷。这些发现揭示了过程级评估固有的挑战,并指明了未来研究的关键方向。我们希望PRMBench能成为推进PRM评估与开发研究的可靠基准。


Controllable Context Sensitivity and the Knob Behind It

Abstract

arXiv:2411.07404v3 Announce Type: replace-cross Abstract: When making predictions, a language model must trade off how much it relies on its context vs. its prior knowledge. Choosing how sensitive the model is to its context is a fundamental functionality, as it enables the model to excel at tasks like retrieval-augmented generation and question-answering. In this paper, we search for a knob which controls this sensitivity, determining whether language models answer from the context or their prior knowledge. To guide this search, we design a task for controllable context sensitivity. In this task, we first feed the model a context (Paris is in England) and a question (Where is Paris?); we then instruct the model to either use its prior or contextual knowledge and evaluate whether it generates the correct answer for both intents (either France or England). When fine-tuned on this task, instruction-tuned versions of Llama-3.1, Mistral-v0.3, and Gemma-2 can solve it with high accuracy (85-95%). Analyzing these high-performing models, we narrow down which layers may be important to context sensitivity using a novel linear time algorithm. Then, in each model, we identify a 1-D subspace in a single layer that encodes whether the model follows context or prior knowledge. Interestingly, while we identify this subspace in a fine-tuned model, we find that the exact same subspace serves as an effective knob in not only that model but also non-fine-tuned instruct and base models of that model family. Finally, we show a strong correlation between a model's performance and how distinctly it separates context-agreeing from context-ignoring answers in this subspace. These results suggest a single subspace facilitates how the model chooses between context and prior knowledge, hinting at a simple fundamental mechanism that controls this behavior.

摘要

在进行预测时,语言模型必须在依赖上下文与先验知识之间做出权衡。选择模型对上下文的敏感程度是一项基本功能,这使其能够在检索增强生成和问答等任务中表现出色。本文旨在寻找控制这种敏感性的调节机制,以确定语言模型是基于上下文还是先验知识回答问题。为引导这一探索,我们设计了一项可控上下文敏感性的任务。在该任务中,我们首先向模型提供上下文(巴黎位于英格兰)和问题(巴黎在哪里?);随后指示模型使用其先验知识或上下文知识,并评估其是否能针对两种意图生成正确答案(法国或英格兰)。当Llama-3.1、Mistral-v0.3和Gemma-2的指令调优版本在此任务上进行微调后,能以高准确率(85-95%)解决问题。通过分析这些高性能模型,我们利用一种新颖的线性时间算法,缩小了可能对上下文敏感性至关重要的层范围。接着,在每个模型中,我们在单一层内识别出一个一维子空间,该子空间编码了模型遵循上下文还是先验知识的选择。有趣的是,尽管我们在微调模型中发现了这一子空间,但相同的子空间不仅在该模型中有效,还能作为未微调的指令模型和基础模型家族中的有效调节机制。最后,我们展示了模型性能与其在该子空间中将上下文一致答案与忽略上下文答案区分开的清晰程度之间存在强相关性。这些结果表明,单一子空间促进了模型在上下文与先验知识之间的选择,暗示了控制该行为的简单基础机制。


Can Large Language Models Be Trusted as Evolutionary Optimizers for Network-Structured Combinatorial Problems?

Abstract

arXiv:2501.15081v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have shown impressive capabilities in language understanding and reasoning across diverse domains. Recently, there has been increasing interests in utilizing LLMs not merely as assistants in optimization tasks, but as active optimizers, particularly for network-structured combinatorial problems. However, before LLMs can be reliably deployed in this role, a fundamental question must be addressed: Can LLMs iteratively manipulate solutions that consistently adhere to problem constraints? In this work, we propose a systematic framework to evaluate the capacity of LLMs to engage with problem structures. Rather than treating the model as a black-box generator, we adopt the commonly used evolutionary operators as optimizer and propose a comprehensive evaluation framework that rigorously assesses the output fidelity of LLM-generated operators across different stages of the evolutionary process. To enhance robustness, we introduce a hybrid error-correction mechanism that mitigates uncertainty in LLM outputs. Moreover, we develop a cost-efficient population-level optimization strategy that significantly improves efficiency compared to traditional individual-level approaches. Extensive experiments on a representative node-level combinatorial network optimization task demonstrate the effectiveness, adaptability, and inherent limitations of LLM-based operators. Our findings offer new perspectives on the integration of LLMs into evolutionary computation, providing practical insights for scalable optimization in networked systems.

摘要

大型语言模型(LLMs)在跨领域语言理解与推理方面展现出卓越能力。近期研究趋势逐渐从将其作为优化任务辅助工具转向作为主动优化器,尤其针对网络结构组合优化问题。然而,在可靠部署LLMs担任此角色前,必须解决一个核心问题:LLMs能否迭代式操控解决方案并始终保持约束满足?本研究提出系统性评估框架,用于检验LLMs处理问题结构的能力。区别于将模型视为黑箱生成器,我们采用经典进化算子作为优化器,构建了全面评估体系,严格检验LLM生成算子在进化过程各阶段的输出保真度。为增强鲁棒性,我们设计了混合纠错机制以降低LLM输出的不确定性。此外,开发了成本优化的群体级优化策略,较传统个体级方法显著提升效率。在典型节点级组合网络优化任务上的大量实验,验证了基于LLM算子的有效性、适应性与固有局限性。本研究为LLMs与进化计算的融合提供了新视角,为网络系统可扩展优化实践提供了重要见解。


Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring

Abstract

arXiv:2502.05242v2 Announce Type: replace-cross Abstract: Large language models (LLMs) are becoming increasingly capable, but the mechanisms of their thinking and decision-making process remain unclear. Chain-of-thoughts (CoTs) have been commonly utilized to monitor LLMs, but this strategy fails to accurately reflect LLMs' thinking process. Techniques based on LLMs' hidden representations provide an inner perspective to monitor their latent thinking. However, previous methods only try to develop external monitors instead of making LLMs themselves easier to monitor. In this paper, we propose a novel method TELLME, improving the transparency of LLMs and helping monitors identify unsuitable and sensitive behaviors. Furthermore, we showcase the applications of TELLME on trustworthiness tasks (\eg, safety risks monitoring tasks and detoxification tasks), where LLMs achieve consistent improvement in transparency and task performance. More crucially, we theoretically analyze the improvement of TELLME on LLMs' generalization ability through optimal transport theory.

摘要

大型语言模型(LLMs)的能力日益增强,但其思维与决策过程的机制仍不明确。思维链(CoTs)虽被广泛用于监测LLMs,但该策略无法准确反映模型的真实思考过程。基于LLMs隐层表征的技术为监测其潜在思维提供了内部视角,然而现有方法仅试图构建外部监测器,而非增强模型自身的可监测性。本文提出创新方法TELLME,通过提升LLMs的透明度帮助监测器识别不当及敏感行为。进一步地,我们在可信度任务(如安全风险监测任务与去毒化任务)上展示了TELLME的应用效果,LLMs在透明度与任务性能方面均获得持续提升。最关键的是,我们通过最优传输理论从理论上分析了TELLME对LLMs泛化能力的改进作用。


LoRA-One: One-Step Full Gradient Could Suffice for Fine-Tuning Large Language Models, Provably and Efficiently

Abstract

arXiv:2502.01235v2 Announce Type: replace-cross Abstract: This paper explores how theory can guide and enhance practical algorithms, using Low-Rank Adaptation (LoRA, Hu et al. 2022) in large language models as a case study. We rigorously prove that, under gradient descent, LoRA adapters align with specific singular subspaces of the one-step full fine-tuning gradient. This result suggests that, by properly initializing the adapters using the one-step full gradient, subspace alignment can be achieved immediately and applicable to both linear and nonlinear models. Building on our theory, we propose a theory-driven algorithm, LoRA-One, where the linear convergence (as well as generalization) is built and incorporating preconditioners theoretically helps mitigate the effects of ill-conditioning. Besides, our theory reveals connections between LoRA-One and other gradient-alignment-based methods, helping to clarify misconceptions in the design of such algorithms. LoRA-One achieves significant empirical improvements over LoRA and its variants across benchmarks in natural language understanding, mathematical reasoning, and code generation. Code is available at: https://github.com/YuanheZ/LoRA-One.

摘要

本文以大型语言模型中的低秩自适应(LoRA,Hu等人2022)为案例,探讨理论如何指导并提升实践算法。我们严格证明在梯度下降过程中,LoRA适配器会与一步全微调梯度的特定奇异子空间对齐。该结果表明,通过使用一步全梯度正确初始化适配器,可立即实现子空间对齐,且该方法适用于线性和非线性模型。基于理论分析,我们提出理论驱动的算法LoRA-One,该算法具备线性收敛性(及泛化性)的理论保证,并证明引入预条件子可有效缓解病态条件的影响。此外,我们的理论揭示了LoRA-One与其他基于梯度对齐方法的内在联系,有助于澄清此类算法设计中的误解。在自然语言理解、数学推理和代码生成等基准测试中,LoRA-One相较LoRA及其变体均取得显著性能提升。代码详见:https://github.com/YuanheZ/LoRA-One。


Non-Markovian Discrete Diffusion with Causal Language Models

Abstract

arXiv:2502.09767v2 Announce Type: replace-cross Abstract: Discrete diffusion models offer a flexible, controllable approach to structured sequence generation, yet they still lag behind causal language models in expressive power. A key limitation lies in their reliance on the Markovian assumption, which restricts each step to condition only on the current state, leading to potential uncorrectable error accumulation. In this paper, we introduce CaDDi, a discrete diffusion model that conditions on the entire generative trajectory, thereby lifting the Markov constraint and allowing the model to revisit and improve past states. By unifying sequential (causal) and temporal (diffusion) reasoning in a single non-Markovian transformer, CaDDi also treats standard causal language models as a special case and permits the direct reuse of pretrained LLM weights with no architectural changes. Empirically, CaDDi outperforms state-of-the-art discrete diffusion baselines on natural-language benchmarks, substantially narrowing the remaining gap to large autoregressive transformers.

摘要

离散扩散模型为结构化序列生成提供了灵活可控的方法,但其表达能力仍落后于因果语言模型。关键局限在于其对马尔可夫假设的依赖,该假设要求每步生成仅能基于当前状态,从而导致可能无法修正的误差累积。本文提出CaDDi模型,这是一种基于完整生成轨迹的离散扩散模型,通过解除马尔科夫约束使模型能够回溯并优化历史状态。通过将序列(因果)推理与时间(扩散)推理统一于单一非马尔可夫Transformer架构中,CaDDi不仅将标准因果语言模型视为特例,还能直接复用预训练大语言模型权重而无需架构修改。实验表明,CaDDi在自然语言基准测试中优于最先进的离散扩散基线模型,显著缩小了与大型自回归Transformer的性能差距。


LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models

Abstract

arXiv:2502.02406v3 Announce Type: replace-cross Abstract: Cross-attention is commonly adopted in multimodal large language models (MLLMs) for integrating visual information into the language backbone. However, in applications with large visual inputs, such as video understanding, processing a large number of visual tokens in cross-attention layers leads to high memory demands and often necessitates distributed computation across multiple GPUs. Existing distributed attention mechanisms face significant communication overheads, making cross-attention layers a critical bottleneck for efficient training and inference of MLLMs. To address this, we propose LV-XAttn, a distributed, exact cross-attention mechanism with minimal communication overhead. We observe that in applications involving large visual inputs, the size of the query block is typically much smaller than that of the key-value blocks. Thus, in LV-XAttn we keep the large key-value blocks locally on each GPU and exchange smaller query blocks across GPUs. We also introduce an efficient activation recomputation technique to support longer visual context. We theoretically analyze the communication benefits of LV-XAttn and show that it can achieve speedups for a wide range of models. Our evaluations with Llama 3-V, mPLUG-Owl3 and OpenFlamingo models find that LV-XAttn achieves up to 10.62×\times end-to-end speedup compared to existing approaches.

摘要

交叉注意力机制在多模态大语言模型(MLLMs)中常被用于将视觉信息整合到语言主干网络。然而,在视频理解等需要处理大规模视觉输入的应用中,交叉注意力层处理大量视觉标记会导致高内存需求,通常需要跨多个GPU进行分布式计算。现有分布式注意力机制面临显著的通信开销,使得交叉注意力层成为MLLMs高效训练与推理的关键瓶颈。为此,我们提出LV-XAttn——一种通信开销极小的分布式精确交叉注意力机制。通过观察发现,在涉及大规模视觉输入的应用中,查询块的大小通常远小于键值块。因此,LV-XAttn将大尺寸键值块保留在各GPU本地,仅跨GPU交换小尺寸查询块。我们还引入了一种高效的激活重计算技术以支持更长视觉上下文。理论分析表明LV-XAttn具有通信优势,可适用于多种模型的加速。基于Llama 3-V、mPLUG-Owl3和OpenFlamingo模型的实验表明,相比现有方法,LV-XAttn能实现最高10.62倍的端到端加速比。


Robust LLM Alignment via Distributionally Robust Direct Preference Optimization

Abstract

arXiv:2502.01930v2 Announce Type: replace-cross Abstract: A major challenge in aligning large language models (LLMs) with human preferences is the issue of distribution shift. LLM alignment algorithms rely on static preference datasets, assuming that they accurately represent real-world user preferences. However, user preferences vary significantly across geographical regions, demographics, linguistic patterns, and evolving cultural trends. This preference distribution shift leads to catastrophic alignment failures in many real-world applications. We address this problem using the principled framework of distributionally robust optimization, and develop two novel distributionally robust direct preference optimization (DPO) algorithms, namely, Wasserstein DPO (WDPO) and Kullback-Leibler DPO (KLDPO). We characterize the sample complexity of learning the optimal policy parameters for WDPO and KLDPO. Moreover, we propose scalable gradient descent-style learning algorithms by developing suitable approximations for the challenging minimax loss functions of WDPO and KLDPO. Our empirical experiments using benchmark data sets and LLMs demonstrate the superior performance of WDPO and KLDPO in substantially improving the alignment when there is a preference distribution shift.

摘要

将大型语言模型(LLM)与人类偏好对齐的一个主要挑战是分布偏移问题。LLM对齐算法依赖于静态偏好数据集,假设它们能准确反映现实世界中的用户偏好。然而,用户偏好在地理区域、人口统计、语言模式以及不断变化的文化趋势等方面存在显著差异。这种偏好分布偏移会导致许多实际应用中出现灾难性的对齐失败。我们利用分布鲁棒优化的原则性框架来解决这一问题,并开发了两种新颖的分布鲁棒直接偏好优化(DPO)算法,即Wasserstein DPO(WDPO)和Kullback-Leibler DPO(KLDPO)。我们刻画了学习WDPO和KLDPO最优策略参数的样本复杂度。此外,针对WDPO和KLDPO中具有挑战性的极小极大损失函数,我们通过开发合适的近似方法,提出了可扩展的梯度下降式学习算法。基于基准数据集和LLM的实证实验表明,当存在偏好分布偏移时,WDPO和KLDPO在显著提升对齐效果方面表现出优越性能。


Advancing Reasoning in Large Language Models: Promising Methods and Approaches

Abstract

arXiv:2502.03671v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have succeeded remarkably in various natural language processing (NLP) tasks, yet their reasoning capabilities remain a fundamental challenge. While LLMs exhibit impressive fluency and factual recall, their ability to perform complex reasoning-spanning logical deduction, mathematical problem-solving, commonsense inference, and multi-step reasoning-often falls short of human expectations. This survey provides a comprehensive review of emerging techniques enhancing reasoning in LLMs. We categorize existing methods into key approaches, including prompting strategies (e.g., Chain-of-Thought reasoning, Self-Consistency, and Tree-of-Thought reasoning), architectural innovations (e.g., retrieval-augmented models, modular reasoning networks, and neuro-symbolic integration), and learning paradigms (e.g., fine-tuning with reasoning-specific datasets, reinforcement learning, and self-supervised reasoning objectives). Additionally, we explore evaluation frameworks used to assess reasoning in LLMs and highlight open challenges, such as hallucinations, robustness, and reasoning generalization across diverse tasks. By synthesizing recent advancements, this survey aims to provide insights into promising directions for future research and practical applications of reasoning-augmented LLMs.

摘要

大语言模型(LLMs)在各种自然语言处理(NLP)任务中取得了显著成功,但其推理能力仍是一个根本性挑战。尽管LLMs展现出令人印象深刻的流畅性和事实记忆能力,但在复杂推理(包括逻辑演绎、数学问题求解、常识推理和多步推理)方面的表现往往不及人类预期。本综述全面回顾了增强LLMs推理能力的新兴技术,将现有方法归类为以下主要方向:提示策略(如思维链推理、自洽性和思维树推理)、架构创新(如检索增强模型、模块化推理网络和神经符号集成)以及学习范式(如基于推理专用数据集的微调、强化学习和自监督推理目标)。此外,我们探讨了用于评估LLMs推理能力的框架,并指出了开放性挑战,如幻觉问题、鲁棒性以及跨多样化任务的推理泛化能力。通过综合近期研究进展,本综述旨在为推理增强型LLMs的未来研究和实际应用提供前瞻性方向。


How Do LLMs Perform Two-Hop Reasoning in Context?

Abstract

arXiv:2502.13913v2 Announce Type: replace-cross Abstract: ``Socrates is human. All humans are mortal. Therefore, Socrates is mortal.'' This form of argument illustrates a typical pattern of two-hop reasoning. Formally, two-hop reasoning refers to the process of inferring a conclusion by making two logical steps, each connecting adjacent concepts, such that the final conclusion depends on the integration of both steps. It is one of the most fundamental components of human reasoning and plays a crucial role in both formal logic and everyday decision-making. Despite recent progress in large language models (LLMs), we surprisingly find that they can fail at solving simple two-hop reasoning problems when distractors are present. We observe on a synthetic dataset that pre-trained LLMs often resort to random guessing among all plausible conclusions. However, after few steps of fine-tuning, models achieve near-perfect accuracy and exhibit strong length generalization. To understand the underlying mechanisms, we train a 3-layer Transformer from scratch on a synthetic two-hop reasoning task and reverse-engineer its internal information flow. We observe a clear progression in the attention logits throughout training. This pictures a sharp phase transition from an initial stage of random guessing to the emergence of a structured sequential query mechanism, where the model first retrieves the preceding and the bridge concepts in the early layers and then uses them to infer the final answer. Finally, we show that these dynamics can be captured by a minimal three-parameter attention-only network.


ReLearn: Unlearning via Learning for Large Language Models

Abstract

arXiv:2502.11190v3 Announce Type: replace-cross Abstract: Current unlearning methods for large language models usually rely on reverse optimization to reduce target token probabilities. However, this paradigm disrupts the subsequent tokens prediction, degrading model performance and linguistic coherence. Moreover, existing evaluation metrics overemphasize contextual forgetting while inadequately assessing response fluency and relevance. To address these challenges, we propose ReLearn, a data augmentation and fine-tuning pipeline for effective unlearning, along with a comprehensive evaluation framework. This framework introduces Knowledge Forgetting Rate (KFR) and Knowledge Retention Rate (KRR) to measure knowledge-level preservation, and Linguistic Score (LS) to evaluate generation quality. Our experiments show that ReLearn successfully achieves targeted forgetting while preserving high-quality output. Through mechanistic analysis, we further demonstrate how reverse optimization disrupts coherent text generation, while ReLearn preserves this essential capability. Code is available at https://github.com/zjunlp/unlearn.

摘要

当前针对大型语言模型的遗忘方法通常依赖反向优化来降低目标标记概率。然而这种范式会干扰后续标记预测,损害模型性能与语言连贯性。现有评估指标过度强调上下文遗忘,却未能充分评估回答流畅度与相关性。为解决这些问题,我们提出ReLearn——一个结合数据增强与微调的高效遗忘框架,并构建了综合评估体系。该框架引入知识遗忘率(KFR)与知识保留率(KRR)衡量知识层面的遗忘效果,通过语言评分(LS)评估生成质量。实验表明ReLearn在实现目标遗忘的同时能保持高质量输出。通过机制分析,我们进一步揭示反向优化如何破坏连贯文本生成能力,而ReLearn则能保留这一核心能力。代码详见https://github.com/zjunlp/unlearn。


ThinkGuard: Deliberative Slow Thinking Leads to Cautious Guardrails

Abstract

arXiv:2502.13458v2 Announce Type: replace-cross Abstract: Ensuring the safety of large language models (LLMs) is critical as they are deployed in real-world applications. Existing guardrails rely on rule-based filtering or single-pass classification, limiting their ability to handle nuanced safety violations. To address this, we propose ThinkGuard, a critique-augmented guardrail model that distills knowledge from high-capacity LLMs by generating structured critiques alongside safety labels. Fine-tuned on critique-augmented data, the captured deliberative thinking ability drastically enhances the guardrail's cautiousness and interpretability. Evaluated on multiple safety benchmarks, ThinkGuard achieves the highest average F1 and AUPRC, outperforming all baselines. Compared to LLaMA Guard 3, ThinkGuard improves accuracy by 16.1% and macro F1 by 27.0%. Moreover, it surpasses label-only fine-tuned models, confirming that structured critiques enhance both classification precision and nuanced safety reasoning while maintaining computational efficiency.

摘要

确保大型语言模型(LLMs)的安全性对于其在实际应用中的部署至关重要。现有的防护机制依赖于基于规则的过滤或单次分类,限制了其处理微妙安全违规的能力。为解决这一问题,我们提出了ThinkGuard,一种基于评论增强的防护模型,通过生成结构化评论与安全标签相结合,从高容量LLMs中提炼知识。在评论增强数据上进行微调后,所捕获的审慎思考能力显著提升了防护机制的谨慎性和可解释性。在多个安全基准测试中,ThinkGuard实现了最高的平均F1值和AUPRC,优于所有基线模型。与LLaMA Guard 3相比,ThinkGuard的准确率提高了16.1%,宏观F1值提升了27.0%。此外,其表现超越了仅基于标签微调的模型,证实结构化评论不仅能提升分类精度,还能增强对复杂安全问题的推理能力,同时保持计算效率。


MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections

Abstract

arXiv:2502.12170v2 Announce Type: replace-cross Abstract: We propose MUltiway Dynamic Dense (MUDD) connections, a simple yet effective method to address the limitations of residual connections and enhance cross-layer information flow in Transformers. Unlike existing dense connection approaches with static and shared connection weights, MUDD generates connection weights dynamically depending on hidden states at each sequence position and for each decoupled input stream (the query, key, value or residual) of a Transformer block. MUDD connections can be seamlessly integrated into any Transformer architecture to create MUDDFormer. Extensive experiments show that MUDDFormer significantly outperforms Transformers across various model architectures and scales in language modeling, achieving the performance of Transformers trained with 1.8X-2.4X compute. Notably, MUDDPythia-2.8B matches Pythia-6.9B in pretraining ppl and downstream tasks and even rivals Pythia-12B in five-shot settings, while adding only 0.23% parameters and 0.4% computation. Code in JAX and PyTorch and pre-trained models are available at https://github.com/Caiyun-AI/MUDDFormer .

摘要

我们提出MUltiway Dynamic Dense(MUDD)连接方法,这是一种简单而有效的解决方案,旨在克服残差连接的局限性并增强Transformer中的跨层信息流。与现有采用静态共享连接权重的密集连接方法不同,MUDD能根据每个序列位置的隐藏状态以及Transformer模块各解耦输入流(查询、键、值或残差)动态生成连接权重。MUDD连接可无缝集成到任何Transformer架构中,形成MUDDFormer。大量实验表明,在语言建模任务中,MUDDFormer在不同模型架构和规模下均显著优于传统Transformer,其性能相当于使用1.8-2.4倍计算资源训练的Transformer。值得注意的是,MUDDPythia-2.8B在预训练困惑度和下游任务中匹配Pythia-6.9B的表现,甚至在五样本设置中媲美Pythia-12B,而仅增加0.23%的参数和0.4%的计算量。JAX和PyTorch实现代码及预训练模型已发布于https://github.com/Caiyun-AI/MUDDFormer。


CoSER: Coordinating LLM-Based Persona Simulation of Established Roles

Abstract

arXiv:2502.09082v2 Announce Type: replace-cross Abstract: Role-playing language agents (RPLAs) have emerged as promising applications of large language models (LLMs). However, simulating established characters presents a challenging task for RPLAs, due to the lack of authentic character datasets and nuanced evaluation methods using such data. In this paper, we present CoSER, a collection of a high-quality dataset, open models, and an evaluation protocol towards effective RPLAs of established characters. The CoSER dataset covers 17,966 characters from 771 renowned books. It provides authentic dialogues with real-world intricacies, as well as diverse data types such as conversation setups, character experiences and internal thoughts. Drawing from acting methodology, we introduce given-circumstance acting for training and evaluating role-playing LLMs, where LLMs sequentially portray multiple characters in book scenes. Using our dataset, we develop CoSER 8B and CoSER 70B, i.e., advanced open role-playing LLMs built on LLaMA-3.1 models. Extensive experiments demonstrate the value of the CoSER dataset for RPLA training, evaluation and retrieval. Moreover, CoSER 70B exhibits state-of-the-art performance surpassing or matching GPT-4o on our evaluation and three existing benchmarks, i.e., achieving 75.80% and 93.47% accuracy on the InCharacter and LifeChoice benchmarks respectively.

摘要

角色扮演语言代理(RPLAs)已成为大语言模型(LLMs)的重要应用方向。然而,由于缺乏真实角色数据集及基于此类数据的精细化评估方法,模拟经典角色对RPLAs而言仍具挑战性。本文提出CoSER框架,包含高质量数据集、开源模型及针对经典角色RPLAs的评估方案。CoSER数据集涵盖771部名著中的17,966个角色,提供具有现实复杂度的真实对话,以及对话场景设置、角色经历与内心独白等多模态数据类型。借鉴表演方法论,我们提出"给定情境表演"训练评估范式,使LLMs能在书摘场景中顺序演绎多重角色。基于该数据集,我们开发了基于LLaMA-3.1模型的先进开源角色扮演LLMs——CoSER 8B与CoSER 70B。大量实验证明CoSER数据集在RPLA训练、评估与检索中的价值。CoSER 70B展现出最先进的性能,在我们的评估及InCharacter(75.80%准确率)、LifeChoice(93.47%准确率)等三大基准测试中超越或比肩GPT-4o。


Revisiting Weak-to-Strong Generalization in Theory and Practice: Reverse KL vs. Forward KL

Abstract

arXiv:2502.11107v3 Announce Type: replace-cross Abstract: As large language models advance toward superhuman performance, ensuring their alignment with human values and abilities grows increasingly complex. Weak-to-strong generalization offers a promising approach by leveraging predictions from weaker models to guide stronger systems, but its effectiveness could be constrained by the inherent noise and inaccuracies in these weak predictions. To address this, we propose a theoretically grounded approach that replaces forward KL divergence-whose mass-covering behavior risks overfitting to imperfect weak signals-with reverse KL divergence. Reverse KL divergence's zero-forcing effect prioritizes high-confidence predictions, effectively mitigating the influence of unreliable weak supervision. Theoretically, we extend existing bounds and derive tighter lower bounds for both forward and reverse KL divergence, establishing that reverse KL achieves at least comparable guarantees to forward KL. Notably, when a sufficiently pre-trained strong model is fine-tuned on the last linear layer, reverse KL guarantees that it outperforms its weak supervisor by the magnitude of their disagreement. Empirically, we demonstrate that reverse KL and reverse cross-entropy enable strong models to successfully outperform those trained with forward KL and standard cross-entropy across most settings, highlighting the practical advantages of these reverse losses.

摘要

随着大语言模型向超人类性能迈进,确保其与人类价值观和能力保持一致变得日益复杂。弱到强泛化提供了一种前景广阔的方法,通过利用较弱模型的预测来指导更强系统,但其效果可能受限于这些弱预测中固有的噪声和误差。为此,我们提出一种理论支撑的方法,用反向KL散度替代前向KL散度——后者的质量覆盖特性可能导致对不完美弱信号的过拟合。反向KL散度的零强制效应会优先处理高置信度预测,有效减轻不可靠弱监督的影响。理论上,我们扩展了现有边界并为前向和反向KL散度推导出更紧致的下界,证明反向KL至少能获得与前向KL相当的保证。值得注意的是,当强模型经过充分预训练并在最后一线性层进行微调时,反向KL能保证其性能以两者差异的幅度超越弱监督者。实证研究表明,在大多数场景下,采用反向KL散度和反向交叉熵训练的强模型,其表现成功超越了使用前向KL散度和标准交叉熵训练的模型,这凸显了反向损失函数的实践优势。


Self-Taught Agentic Long Context Understanding

Abstract

arXiv:2502.15920v2 Announce Type: replace-cross Abstract: Answering complex, long-context questions remains a major challenge for large language models (LLMs) as it requires effective question clarifications and context retrieval. We propose Agentic Long-Context Understanding (AgenticLU), a framework designed to enhance an LLM's understanding of such queries by integrating targeted self-clarification with contextual grounding within an agentic workflow. At the core of AgenticLU is Chain-of-Clarifications (CoC), where models refine their understanding through self-generated clarification questions and corresponding contextual groundings. By scaling inference as a tree search where each node represents a CoC step, we achieve 97.8% answer recall on NarrativeQA with a search depth of up to three and a branching factor of eight. To amortize the high cost of this search process to training, we leverage the preference pairs for each step obtained by the CoC workflow and perform two-stage model finetuning: (1) supervised finetuning to learn effective decomposition strategies, and (2) direct preference optimization to enhance reasoning quality. This enables AgenticLU models to generate clarifications and retrieve relevant context effectively and efficiently in a single inference pass. Extensive experiments across seven long-context tasks demonstrate that AgenticLU significantly outperforms state-of-the-art prompting methods and specialized long-context LLMs, achieving robust multi-hop reasoning while sustaining consistent performance as context length grows.

摘要

回答复杂的长上下文问题仍然是大型语言模型(LLM)面临的主要挑战,因为这需要有效的问题澄清和上下文检索。我们提出了Agentic长上下文理解框架(AgenticLU),该框架通过将目标导向的自我澄清与代理工作流中的上下文 grounding 相结合,旨在增强LLM对此类查询的理解。AgenticLU的核心是澄清链(CoC),模型通过自我生成的澄清问题和相应的上下文 grounding 来细化其理解。通过将推理扩展为树搜索,其中每个节点代表一个CoC步骤,我们在NarrativeQA上实现了97.8%的答案召回率,搜索深度达三层,分支因子为八。为了将这一高成本搜索过程分摊到训练阶段,我们利用CoC工作流获得的每个步骤的偏好对,并进行两阶段模型微调:(1)监督微调以学习有效的分解策略,(2)直接偏好优化以提升推理质量。这使得AgenticLU模型能够在单次推理中高效生成澄清并检索相关上下文。在七个长上下文任务上的大量实验表明,AgenticLU显著优于最先进的提示方法和专用长上下文LLM,实现了稳健的多跳推理,同时随着上下文长度的增长保持一致的性能。


TS-RAG: Retrieval-Augmented Generation based Time Series Foundation Models are Stronger Zero-Shot Forecaster

Abstract

arXiv:2503.07649v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs) and Foundation Models (FMs) have recently become prevalent for time series forecasting tasks. While fine-tuning LLMs enables domain adaptation, they often struggle to generalize across diverse and unseen datasets. Moreover, existing Time Series Foundation Models (TSFMs) still face challenges in handling non-stationary dynamics and distribution shifts, largely due to the lack of effective mechanisms for adaptation. To this end, we present TS-RAG, a retrieval-augmented generation framework for time series forecasting that enhances the generalization and interpretability of TSFMs. Specifically, TS-RAG leverages pre-trained time series encoders to retrieve semantically relevant segments from a dedicated knowledge base, enriching the contextual representation of the input query. Furthermore, we propose an Adaptive Retrieval Mixer (ARM) module that dynamically fuses the retrieved patterns with the TSFM's internal representation, improving forecasting accuracy without requiring task-specific fine-tuning. Thorough empirical studies on seven public benchmark datasets demonstrate that TS-RAG achieves state-of-the-art zero-shot forecasting performance, outperforming the existing TSFMs by up to 6.84% across diverse domains while also providing desirable interpretability.

摘要

大型语言模型(LLMs)与基础模型(FMs)近期在时间序列预测任务中得到广泛应用。尽管微调LLMs可实现领域适应,但这些模型往往难以泛化至多样化的未知数据集。现有时间序列基础模型(TSFMs)在处理非平稳动态和分布偏移时仍面临挑战,主要源于缺乏有效的适应机制。为此,我们提出TS-RAG——一种用于时间序列预测的检索增强生成框架,该框架通过增强TSFMs的泛化能力与可解释性来解决上述问题。具体而言,TS-RAG利用预训练的时间序列编码器从专用知识库中检索语义相关的片段,从而丰富输入查询的上下文表征。此外,我们提出自适应检索混合器(ARM)模块,该模块动态地将检索到的模式与TSFM内部表征相融合,在不需任务特定微调的情况下提升预测精度。在七个公共基准数据集上的全面实验表明,TS-RAG实现了最先进的零样本预测性能,跨领域任务中较现有TSFMs最高提升6.84%,同时具备良好的可解释性。


When Trust Collides: Decoding Human-LLM Cooperation Dynamics through the Prisoner's Dilemma

Abstract

arXiv:2503.07320v2 Announce Type: replace-cross Abstract: As large language models (LLMs) become increasingly capable of autonomous decision-making, they introduce new challenges and opportunities for human-AI cooperation in mixed-motive contexts. While prior research has primarily examined AI in assistive or cooperative roles, little is known about how humans interact with AI agents perceived as independent and strategic actors. This study investigates human cooperative attitudes and behaviors toward LLM agents by engaging 30 participants (15 males, 15 females) in repeated Prisoner's Dilemma games with agents differing in declared identity: purported human, rule-based AI, and LLM agent. Behavioral metrics, including cooperation rate, decision latency, unsolicited cooperative acts and trust restoration tolerance, were analyzed to assess the influence of agent identity and participant gender. Results revealed significant effects of declared agent identity on most cooperation-related behaviors, along with notable gender differences in decision latency. Furthermore, qualitative responses suggest that these behavioral differences were shaped by participants interpretations and expectations of the agents. These findings contribute to our understanding of human adaptation in competitive cooperation with autonomous agents and underscore the importance of agent framing in shaping effective and ethical human-AI interaction.

摘要

随着大语言模型(LLM)在自主决策方面能力日益增强,其在混合动机情境下为人机协作带来了新的挑战与机遇。现有研究主要关注人工智能在辅助性或协作性角色中的表现,而对于人类如何与被视为独立战略主体的AI智能体进行交互仍知之甚少。本研究通过让30名参与者(15男15女)与不同声明身份的智能体(声称是人类、基于规则的AI和LLM智能体)进行重复囚徒困境博弈,探究人类对LLM智能体的合作态度与行为。通过分析合作率、决策延迟、主动合作行为及信任恢复容忍度等行为指标,评估了智能体身份和参与者性别的影响。结果显示,声明的智能体身份对大多数合作相关行为具有显著影响,同时在决策延迟方面观察到明显的性别差异。此外,定性反馈表明这些行为差异源于参与者对智能体的认知解读与预期。这些发现深化了我们对人类与自主智能体竞争性协作中适应行为的理解,并强调了智能体身份框架对构建有效且合乎伦理的人机交互的重要性。


SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement

Abstract

arXiv:2504.03561v2 Announce Type: replace-cross Abstract: In the interaction between agents and their environments, agents expand their capabilities by planning and executing actions. However, LLM-based agents face substantial challenges when deployed in novel environments or required to navigate unconventional action spaces. To empower agents to autonomously explore environments, optimize workflows, and enhance their understanding of actions, we propose SynWorld, a framework that allows agents to synthesize possible scenarios with multi-step action invocation within the action space and perform Monte Carlo Tree Search (MCTS) exploration to effectively refine their action knowledge in the current environment. Our experiments demonstrate that SynWorld is an effective and general approach to learning action knowledge in new environments. Code is available at https://github.com/zjunlp/SynWorld.

摘要

在智能体与环境的交互过程中,智能体通过规划与执行动作来扩展其能力。然而,基于大语言模型的智能体在部署至新环境或需要应对非常规动作空间时面临重大挑战。为使智能体能够自主探索环境、优化工作流程并增强对动作的理解,我们提出SynWorld框架。该框架允许智能体在动作空间内通过多步动作调用来合成可能场景,并通过蒙特卡洛树搜索(MCTS)探索有效优化其在当前环境中的动作知识。实验表明,SynWorld是一种在新环境中学习动作知识的有效通用方法。代码发布于https://github.com/zjunlp/SynWorld。


Experience Retrieval-Augmentation with Electronic Health Records Enables Accurate Discharge QA

Abstract

arXiv:2503.17933v2 Announce Type: replace-cross Abstract: To improve the reliability of Large Language Models (LLMs) in clinical applications, retrieval-augmented generation (RAG) is extensively applied to provide factual medical knowledge. However, beyond general medical knowledge from open-ended datasets, clinical case-based knowledge is also critical for effective medical reasoning, as it provides context grounded in real-world patient experiences.Motivated by this, we propose Experience Retrieval-Augmentation ExpRAG framework based on Electronic Health Record(EHR), aiming to offer the relevant context from other patients' discharge reports. ExpRAG performs retrieval through a coarse-to-fine process, utilizing an EHR-based report ranker to efficiently identify similar patients, followed by an experience retriever to extract task-relevant content for enhanced medical reasoning.To evaluate ExpRAG, we introduce DischargeQA, a clinical QA dataset with 1,280 discharge-related questions across diagnosis, medication, and instruction tasks. Each problem is generated using EHR data to ensure realistic and challenging scenarios. Experimental results demonstrate that ExpRAG consistently outperforms a text-based ranker, achieving an average relative improvement of 5.2%, highlighting the importance of case-based knowledge for medical reasoning.

摘要

为提高大型语言模型(LLMs)在临床应用中的可靠性,检索增强生成(RAG)技术被广泛用于提供事实性医学知识。然而,除开放数据集中的通用医学知识外,基于临床病例的知识对有效医学推理同样至关重要,因其提供了基于真实患者经历的上下文。受此启发,我们提出基于电子健康记录(EHR)的经验检索增强框架ExpRAG,旨在从其他患者的出院报告中提取相关上下文。该框架采用由粗到精的检索流程:首先通过EHR报告排序器高效识别相似患者,再由经验检索器提取任务相关内容以增强医学推理。为评估ExpRAG,我们构建了DischargeQA临床问答数据集,包含诊断、用药和指导任务中的1,280个出院相关问题。每个问题均基于EHR数据生成,确保场景真实且具有挑战性。实验表明,ExpRAG始终优于基于文本的排序器,平均相对提升达5.2%,印证了病例知识对医学推理的重要性。


Token embeddings violate the manifold hypothesis

Abstract

arXiv:2504.01002v2 Announce Type: replace-cross Abstract: A full understanding of the behavior of a large language model (LLM) requires our understanding of its input token space. If this space differs from our assumptions, our understanding of and conclusions about the LLM will likely be flawed. We elucidate the structure of the token embeddings both empirically and theoretically. We present a novel statistical test assuming that the neighborhood around each token has a relatively flat and smooth structure as the null hypothesis. Failing to reject the null is uninformative, but rejecting it at a specific token ψ\psi implies an irregularity in the token subspace in a ψ\psi-neighborhood, B(ψ)B(\psi). The structure assumed in the null is a generalization of a manifold with boundary called a \emph{smooth fiber bundle} (which can be split into two spatial regimes -- small and large radius), so we denote our new hypothesis test as the ``fiber bundle hypothesis.'' Failure to reject the null hypothesis is uninformative, but rejecting it at ψ\psi indicates a statistically significant irregularity at B(ψ)B(\psi). By running our test over several open-source LLMs, each with unique token embeddings, we find that the null is frequently rejected, and so the evidence suggests that the token subspace is not a fiber bundle and hence also not a manifold. As a consequence of our findings, when an LLM is presented with two semantically equivalent prompts, if one prompt contains a token implicated by our test, the response to that prompt will likely exhibit less stability than the other.

摘要

要全面理解大语言模型(LLM)的行为,必须首先理解其输入标记空间的结构。若该空间与我们的假设存在偏差,则对LLM的理解与结论很可能存在缺陷。本文通过实证与理论分析阐明了标记嵌入的结构特征。我们提出了一种新颖的统计检验方法,其原假设设定为每个标记邻域具有相对平坦且平滑的结构。若无法拒绝原假设则无信息价值,但在特定标记ψ\psi处拒绝原假设,则意味着ψ\psi-邻域B(ψ)B(\psi)内存在标记子空间的不规则性。该原假设所设定的结构是对具有边界的流形(称为\emph{光滑纤维丛},可分为小半径与大半径两种空间区域)的广义化,因此我们将新假设检验命名为"纤维丛假设"。无法拒绝原假设不具信息量,但在ψ\psi处拒绝原假设则表明B(ψ)B(\psi)存在统计显著的不规则性。通过对多个具有独特标记嵌入的开源LLM进行检验,我们发现原假设频繁被拒绝,证据表明标记子空间并非纤维丛结构,因而也不是流形。基于此发现,当LLM接收两个语义等价的提示时,若其中一个提示包含被检验方法检测出异常的标记,则该提示的响应稳定性很可能低于另一个提示。


Layers at Similar Depths Generate Similar Activations Across LLM Architectures

Abstract

arXiv:2504.08775v2 Announce Type: replace-cross Abstract: How do the latent spaces used by independently-trained LLMs relate to one another? We study the nearest neighbor relationships induced by activations at different layers of 24 open-weight LLMs, and find that they 1) tend to vary from layer to layer within a model, and 2) are approximately shared between corresponding layers of different models. Claim 2 shows that these nearest neighbor relationships are not arbitrary, as they are shared across models, but Claim 1 shows that they are not "obvious" either, as there is no single set of nearest neighbor relationships that is universally shared. Together, these suggest that LLMs generate a progression of activation geometries from layer to layer, but that this entire progression is largely shared between models, stretched and squeezed to fit into different architectures.

摘要

独立训练的大语言模型(LLMs)所使用的潜在空间之间有何关联?我们研究了24个开源权重LLMs不同层激活所诱导的最近邻关系,发现:1)这些关系在同一模型内往往随层数变化;2)不同模型对应层之间的最近邻关系近似共享。结论2表明这些最近邻关系并非任意存在,因为它们在不同模型间具有共享性;而结论1则显示它们也非"显而易见",因为不存在 universally 共享的统一最近邻关系集。综合表明,LLMs会逐层生成激活几何结构的递进序列,但该完整序列在不同模型间高度共享,仅通过拉伸或压缩以适应不同架构。